Blaisorblade programming thoughts

Flexible Implicits (from Agda) for Scala

2013-01-19T22:58:00.001+01:00

[This is my first blog post over my new favorite programming language, Scala. The target audience for this message are advanced Scala users and programming language researchers with knowledge of the language.]

Scala's implicit syntax has a few limitations. It's not just ugly - it reduces both elegance and expressiveness, especially in combination with path-dependent types. In this blog post, I will propose a general solution, taken from a language feature of Agda, a fully dependently-typed language. I even mention (very briefly) how my proposal allows extending lightweight modular staging (LMS) (Rompf and Odersky, 2010) to describe computations on different locations.

Problem #1: Bakery of doom

Let us consider this Scala code (a simplification of code which you end up writing when using macros - the real Context type is scala.reflect.macros.Context):

trait Context {

type Tree

}

def treeTransf1(c: Context)(tree: c.Tree): c.Tree

def treeTransf2(c: Context)(tree: c.Tree): c.Tree

def treeTransf(c: Context)(tree: c.Tree): c.Tree =

treeTransf2(c)(treeTransf1(c)(tree))

In this code, what we really want to pass around is tree, but since it has a path-dependent type, namely c.Tree , we need to also pass around c: Context. When this question was raised among scala hackers, more than one proposed allowing this syntax:

def treeTransf1(tree: c.Tree)(implicit c: Context): c.Tree

Adriaan Moors then explained that one could think of the above as:

def treeTransf1[val c: Context](tree: c.Tree): c.Tree

because implicits are similar to type parameters, but did not propose actually using such a syntax. I'm proposing that syntax instead, or rather a variant:

def treeTransf1[[c: Context]](tree: c.Tree): c.Tree

Continuing the above example, I'd write:

def treeTransf2[[c: Context]](tree: c.Tree): c.Tree

def treeTransf[[c: Context]](tree: c.Tree): c.Tree =

treeTransf2(treeTransf1(tree))

I will explain why it solves other problems as well. I am aware that the bakery of doom can be worked around through other solutions. However, generalizing implicits is more general and elegant.

What's the new syntax?

It introduces new-style implicit parameter list, signaled by double brackets [[paramName: ParamType]].
Such lists can be at any position among other parameter lists.
Analogously, I propose to allow lists of type parameters at any position. This way, type parameter bounds will be able to refer to parameters in previous parameter lists.
To specify explicitly an implicit argument for new-style implicit parameters, you'll need to bracket the argument with double brackets [[]] again.
If you are worrying about source compatibility, you needn't: this is an alternative syntax, which can coexist with the existing one.
In all the examples given, such implicit parameters need to be inferred based on later parameters, using unification like type parameters. Unlike in Agda, only implicit values can be inferred in the current proposal — an alternative would be to allow values which are not marked as implicits if unification finds them as a unique solution.

Why is it simpler?

It removes an arbitrary limitation, that is, having a single list of implicit parameters as the last one. Already today, I can workaround this arbitrary limitation by writing:
def bippy(arg1: Foo)(implicit arg2: Bar): Bar2 => Bippy
but such a function cannot be used conveniently; moreover, since it returns a function, it pays the overhead of partial function application. A call to the analogous:
def bippy(arg1: Foo)[[arg2: Bar]](arg3: Bar2): Bippy
would be erased to a standard method call.
In particular, this is simpler than the solution which was proposed:
def treeTransf1(tree: c.Tree)(implicit c: Context): c.Tree
which would be hard to support in Scalac, since parameters are used before being passed, and cannot be supported using the dependent function type constructor Pi, standard in dependently-typed lambda calculi.

Problem #2: for what parameter is this argument?

Sometimes, for syntactic reasons, you need to specify implicits arguments explicitly - that's ugly and unnecessary.

def buildMap[K, V](data: Seq[(K, V)])(implicit ordK: Ordering[K]): Map[K, V]

Let's try to call this function and extract immediately an entry from this map:

val m = buildMap(List(("Italy", "Euro"), ("US", "Dollar")))

m("Italy")

But suppose we never use m again. We'd like to rewrite the above code as:

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))("Italy")

but we can't - Scalac thinks "Italy" is an argument for the ordK parameter, and complains because it has the wrong type. The reason is that at the call site implicit and explicit applications cannot be distinguished. With the extension I propose, we could write instead:

def buildMap[K, V](data: Seq[(K, V)])[[ordK: Ordering[K]]]: Map[K, V]

and call this as

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))("Italy")

We can still specify ordK explicitly, but we need to use a different syntax for these implicit arguments.

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))[[implicitly[Ordering[String]]]]("Italy")

Here I just passed implicitly[Ordering[String]] for simplicity, but other values are possible.

Problem #3: tagging types to separate different worlds/regions/locations

Problem #1 is described as a consequence of using the cake pattern. However, the same problem can arise in different scenarios.
Suppose you want to divide values of some type in different incompatible worlds and enforce statically that they cannot be mixed. While the need for this may not arise as often, one example is in the reflection/macro API, where code being compiled (for instance in macros) should not be mixed with code being run (accessible through reflection) — see the documentation on reflection universes. For instance, in the above code the type Tree is parameterized on a value of type Context. Trees from different Contexts should not be mixed together. The same applies to different Exprs.

Consider a method which, given two expressions, combines them together, producing an expression representing their sum:

def combine(c: Context)(tree1: c.Expr[Int], tree2: c.Expr[Int]): c.Expr[Int]

(I'll ignore the possibility of using splicing, because splicing is specific to the reflection library, while the problem at hand is more general).
The need to pass the context makes it less convenient to use. This would be more annoying if it were an infix operator. With our syntax, the problem disappears:

def combine[[c: Context]](tree1: c.Expr[Int], tree2: c.Expr[Int]): c.Expr[Int]

Now we can even make this a binary operator:

implicit class RichExpr[[c: Context]](tree1: c.Expr[Int]) {
def +(tree2: c.Expr[Int]): c.Expr[Int]
}

Another example is handling of multiple access spaces. In various scenarios, you might want to distinguish between local memory and remote memory, which can be allocated in different locations. Converting between different locations requires copying the data and is thus expensive; hence such conversions should not happen implicitly, but should be inserted explicitly by the programmer, who should take care of having as few conversions as possible.

trait Location {
type Rep[+T] //as in Lightweight modular staging.
}

trait Matrix[T: Numeric] {
}

implicit class RichMatrix[[l: Location]](m1: l.Rep[Matrix[Int]]) {
def *(m2: l.Rep[Matrix[Int]]): l.Rep[Matrix[Int]] =
??? //omitted
}

Using such technique, one can embed different type systems where a region separation is enforced — for instance phase types as described by Cohen et al. (2012), or (a generalization of) client/server locations, as used by Ibrahim et al. (2009) and Oleg Kiselyov (2010). In the latter case, using locations is necessary to track multiple hosts.

Conclusion

I hope my examples suggest that this proposal, while simple and (in hindsight) pretty obvious, has several advantages over the current situation. What I find amazing is that while this is (or maybe only seems?) a matter of syntax, it affects the expressivity of the language, while in programming language theory syntax is usually considered of little importance.

Feedback is more than welcome.

Bibliography:

Tiark Rompf and Martin Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs, GPCE 2010.

Michael Cohen, Haitao Steve Zhu, Senem Ezgi Emgin, Yu David Liu, Energy Types, OOPSLA 2012.

Ali Ibrahim, Yang Jiao, Eli Tilevich, and William R. Cook. Remote batch invocation for compositional object services, ECOOP 2009.

Oleg Kiselyov, Semi-implicit batched remote code execution as staging. http://okmij.org/ftp/meta-future/, 2010.

"To submit your Computer Science research, please agree to submit no software"

2012-12-03T01:18:00.000+01:00

For services where you can post content, Terms of Service often require you to censor yourself, and that can make sense. But if they call for self-censorship in research, then I, as a researcher, am concerned, and other researchers should be, too.

When research papers are submitted to conferences for review and publication, they are often managed through some automatic conference management system, which then researchers have to use to submit a paper. If I want to use those automatic systems, I need to accept these terms of services (which sounds already potentially controversial, but I have much more serious issues to discuss). One of these systems, EasyChair, recently introduced some terms of service. Since the EasyChair website explicitly asks for feedback, I started skimming them and found the following problematic clause:

"5.10 You agree not to post any Content that

[…]

(e) contains unsolicited or unauthorised advertising, promotional

materials, junk mail, spam, chain letters, pyramid schemes, or any

other form of solicitation;

(f) contains files or programs designed to interrupt, destroy or

limit the functionality of any computer software or hardware or

telecommunications equipment;

[…]"

Notice in particular point (c), from which comes the title of this post. So wait, I'm not supposed to post "invalid data" and that's discussed together with "viruses" and other "software agents"? Any Computer Science paper might include programs - in particular, source code for programs which are discussed in the paper; there is even research about "software agents". Also, PostScript documents are clearly software, and I maybe PDF documents contain software agents as well; for instance, font faces are often software.

But OK, let's say that the point of these terms is just to forbid hackers from using the service, point (c) is really about malicious agents, and that they are just badly phrased. What would happen if I were a security researcher (which I am not)? Won't my research be possibly about viruses, and possibly contain actual viruses attached to the paper? I don't mean to harm the people accessing my content, but I'm still forbidden from attaching viruses even if that's relevant for my research and they're not harmful.

There was a couple clause I omitted, since I'm most confused about them:

"5.10 You agree not to post any Content that

(a) may constitute or contribute to a crime or tort;

[…]

(d) infringes any patent, trademark, trade secret, copyright or

other proprietary rights of any party;"

This seems to make sense, but last time I checked some crimes required computers. And those computers work only thanks to the work of many researchers. Now, are those researchers contributing to crime? If this sounds too philosophical, what if I publish a paper which breaks a cryptographic algorithm or reveal some vulnerability? It's clear that this "may contribute to a crime" when the crime is stealing encrypted data.

I'm not sure about clause (d) on patents. It takes tons of money to check that my idea was not covered by some stupid patent with some strange phrasing, and I know no researcher who bothers doing that. Moreover, in Europe software patents are by law not valid, even all the ones approved by the patent office, so why should I bother? Just because I need that to make a company happy, who's not even publishing my research paper, just storing it for reviewers to see?

Disclaimer: I am not a lawyer, but as a researcher using this system I'm still supposed to read and accept these terms of service. If they're not a legal problem, how am I supposed to know? The company is supposed to make them as clear as possible so that I can accept them without problems.
I also don't fear realistically being sued, should I ever violate these terms.
On the other hand, I wouldn't be surprised if somebody attached a virus to his paper as "source material", some antivirus identified it and the author's account was terminated — by Murphy's law, right before a deadline.

Finally, I am additionally contacting the EasyChair authors for feedback, but I still thought that making my message public would help anybody else concerned.

Automatic race detection

2010-07-28T15:33:00.000+02:00

I wanted to recommend a cool Google tool, based on Valgrind, which performs automatic race detection:

http://code.google.com/p/data-race-test/wiki/ThreadSanitizer

Also DRD and Helgrind exist (both part of Valgrind), but I've seen Helgrind considered "not very stable" more often than not (and it's still maybe a bit experimental). While ThreadSanitizer seems to be in active development and use, and it is supported by Google. And it even has Vim integration!
I remembered of this during this discussion about something similar to race detection, in a context halfway between Python and share-nothing concurrency: http://codespeak.net/pipermail/pypy-dev/2010q3/006037.html
I hope to get your comments on this tool.

Pretending you know what you're talking about - a book review

2010-07-27T17:29:00.000+02:00

Some books in Computer Science seem to do just that: pretending they know what they are talking about, when they in fact do not.

A quote on this, not restricted to Java books:

Most Java books are written by people who couldn't get a job as a Java programmer (since programming almost always pays more than book writing; I know because I've done both). These books are full of errors, bad advice, and bad programs.

Peter Norvig, Director of Research at Google, Inc.

I have seen such books myself. But they do not get great reviews at Amazon and they do not get slashdotted.

Instead, Virtual Machine Design and Implementation in C/C++ does. Its author is great at pretending he knows what he is talking about. I started looking at it for my seminar on Virtual Machines, hoping to make some use of it, and I was deeply disappointed. Looking at the table of contents and index, Virtual Machines: Versatile Platforms for Systems and Processes seems to be a better book on the topic , which covers also system virtualization (a different topic having the same name), however I cannot really judge.
Virtual Machines, by Iain D. Craig, seems instead devoted to semantic issues, and I am not qualified to judge that topic, only to say it is different.
Back to the first book, after reading sample pages from Amazon preview (mostly the table of contents and the index), together with all Amazon reviews, I realized what is happening here.

So, the rest of this post is a (meta-?)review against this book - which is much less interesting, unless you were actually considering to buy it.

The author does know what he is talking about, and spent lots of time polishing it, he's just totally unaware that it is completely unrelated to the title of the book, since he has no experience in the field. Reading the introduction after reading the above quote is enlightening - the author mentions being poor (page xvii), and his experience (described at page iv), like writing CASE tools in Java, is totally unrelated - if he couldn't get a better job, I'd say Norvig's quote is exactly about him. And after reading this, the mention he makes about when "he used to disable paging and run on pure SDRAM" smells of a lamer wanting to show off (in other contexts, it could be just a joke, I know).

The author is just trying to learn by himself how to implement an emulator, and writing a diary on this. If you care about real Virtual Machines (Parrot, the JVM, .NET, and so on), you need entirely different material. Say, JIT compilation. Other reviews mention some more points which are missing, but none of them had a real introduction to the field, so they are not aware of how much else is not in the book. Actually, maybe he knows how a VM looks from outside, but his attempts to move in that direction (like support for multithreading) look quite clumsy - he talks about that and then does not manage to implement it.

Finally, the author seems to be an assembler programmer who is programming assembler in C++. As we remember, it is famously known that "the determined Real Programmer can write Fortran programs in any language", and it is still true with Assembler. Things like manual loop unrolling on debug statements (mentioned in reviews) are quite funny.

In the end, I'd recommend this book to nobody - it might contain some interesting stuff about the actual topic, asa acknowledged by some reviews, but I would not buy a book because of this hope. Especially, not for who cares about Virtual Machines.

JVM vs .NET CLR - a comparison between the VMs

2010-07-26T22:00:00.000+02:00

I was reading a Java vs. F# comparison on this post, and it ended up comparing the JVM vs. the CLR. It also tries to counter some points of a post by Cliff Click, but it does so in a bad way. That's why I try here to improve on that comparison.

The three real limitations of the JVM, compared to the CLR, are the lack of:

tail calls (being addressed), which are important for functional languages running on top of the JVM, like Clojure and Scala
value types: I was happy to read on that post that there's interest on that, and I already read this on Guy Steele's Growing a Language.
proper generics, without erasure: problems were well-known when generics were introduced, but back-compatibility with binary libraries forced the current solution, so this one is not going to be solved.

As an additional, it seems to me that since Java and the JVM is not managed by a single company, but by the Java Community Process, addition of new features like these is much slower (but hopefully they are better designed).

Combining generics with value types would allow great memory (and thus performance) savings: one could define Pair as a value type and then use an ArrayList> as the backing storage for an hash table and pay no space overhead.

An additional point for dynamic languages, the lack of an invokedynamic primitive creates significant performance problems - for instance, Jython (a JITted language) is 2x slower than CPython (an interpreted language with a slow interpreter). Anecdotal evidence suggests that lack of support for inline caching is an important reason: namely, a reimplementation of Python on top of the JVM for a course project, which allowed inline caching, was much faster.

About JVM vs CLR JITs, discussing the quality of existing implementations: Cliff Click mentioned old anecdotal evidence of the CLR JIT being slower because .NET is geared towards static compilation and not so much effort has been put into it. I guess that Click refers to some optimizations which are missing from the CLR. For instance, at some point in the past replacing List (a class type) by IList (an interface type), in C#, caused a 10x slowdown that a good JIT compiler is able to remove, as discussed in Click's blog post. Dunno if this still holds.

Anyway, this comparison is about the JVM vs. the Microsoft CLR implementation, running only on Windows. Mono is available for Linux, but it uses a partially conservative garbage collector based on Boehm's GC, and this gives really inferior results in some cases. The Boehm's authors claim here that such cases can be easily fixed by changing the source program, but this is valid only when the application is written for this GC, not when you want to support programs written for an accurate GC.
Some evidence about this, where the described worst case is implemented, is available here.
In the end, if you want to run over Linux, you have no real choice if you aim for the best and most robust performance, currently (as also suggested by this entry of the Language Shootout, but read their notes about the flawedness of such comparisons). We have to hope for improvements in Mono.

Teach Yourself Programming in Ten Years

2010-06-12T18:33:00.000+02:00

I came across this link by random browsing (Internet, you know?) and it's just wonderful.
Just one quote, about the "Teach Yourself $Language_X in $small_number_of_days" books: "Bad programming is easy. Idiots can learn it in 21 days, even if they are dummies."

http://norvig.com/21-days.html

Stereotypes, averages and benchmarks

2010-06-09T04:31:00.000+02:00

While reasoning on the idea of stereotypes and how useful they are to understand a different culture, I realized that a stereotype, together with problems in using it, is at least as as bad as an average over a diverse set (here we'll ignore the fact that it's also a judgement by a culture on another culture, often presuming without reason that). And as we know,

And after realizing this, one sees connections with many issues in research and in Computer Science. For instance, averages of different benchmarks. It's often fine to benchmark two different algorithms on some data sets: if the algorithm domain is simple enough, the variation on different inputs is small, and the input can anyway be characterized by a small number of features. Think of sorting algorithms: the inputs you pass them can be random, Gaussian, ordered, reverse-ordered, and that's it more or less. OK, I'm oversimplifying a bit, I guess. But in other cases, the input space has a much higher number of dimensions.

And then, even worse, benchmarks about programming languages. Not only we have all these problems about sampling the input space, but in this case the input spaces of two language implementations, for different languages, are not the same, and they don't match in a meaningful way: they are not isomorphic, just like spoken languages. In other words, there is not a unique way of translating a program, and when you do, there's no way to make the pieces match one-by-one. There are infinite ways to solve a problem in any Turing-complete language (and a not-so-small number of reasonable alternative), and an intrinsically different infinity of ways for another language. And maybe, your translation is too slow because it's not appropriate for that language, and you should write the same program in a more idiomatic style.

The same concept is expressed here in a less abstract way.

And in this situation, not only we have actual benchmarks about programming languages, but even performance myths for or against Java, C#, Python, C, and so on. I.e., stereotypes, once again, about languages. And this time, mostly false.

For instance, we could talk of people thinking that Java is slow and Python is more lightweight, while it's actually the other way around, as long as we speak of throughput. What those people think is only true about startup latency, and only partially about memory consumption (Python has prompter garbage collection because of refcounting, but its object layout could benefit some huge improvement). And now in this example, we see that not only the input space is high-dimensional, but that even the performance space cannot be characterized by a single number. To compare memory consumption, we need to give a function of the used memory for a given object set, for Java and for Python. And the object sets are, again, not isomorphic! We're over.

Trying to sum up too much any result is going to give us lies. We can't help it; we should stop asking for simple answers to hard questions, and for silver bullets, and for a lot of other easy things, and go instead to work hard and enjoy the results.

Why conservative garbage collection should be avoided

2010-06-09T02:16:00.000+02:00

Sometimes one still sees around conservative garbage collection. I do not know how common that is, but Mono does use it, as did many JS implementation in the pre-V8 era (according to V8 website). And that's really a pity!

A quick intro first: conservative GC means, by definition, that since you don't know which words are pointers and which are just random other stuff, you are conservative and just assume they might all be pointers.

There are two reasons why conservative GC is bad:

First, and most obvious: you might have false positives, so you keep in memory more objects than you should. I'm not really sure how common this is, and how relevant. But it's not the really bad thing. Two other problems come into play!
You may not write a compacting collector, i.e. one that moves objects to be consecutive in memory at garbage collection time. Such a collector allows memory allocation to be much faster than with malloc()/free(): if all allocated objects form a sequence, also free memory does. Then you know where you can get free memory.
In contrast, malloc()/free() have to manage free lists of some kinds, which list the available blocks of free memory, and perform a search through them for a block of the appropriate size; the actual details of this vary, depending on the used algorithm, but it's anyway slower. And the fact that memory allocation is much faster with GC is one of the coolest things about it. Depending on the scenario, a program might even be faster with GC, and there are papers about that.
Another problem is that if you don't compact memory, you get memory fragmentation: free memory is fragmented in more small blocks that may be of little use when allocating a bigger object, and still may not be used by any other program on the system. It's theoretically possible that a substantial fraction of the used memory of your app is just wasted by fragmentation, even if it's not so easy for me to find a realistic scenario where this happens.

Refcounting and weirdness (in Python and Go)

2010-05-04T19:07:00.000+02:00

Reading Cliff Click's blog is always insightful. Thanks to his blog, I realized that a refcounted language would have a weird enough memory model, if it allowed real multithreading. Better yet: CPython folks still think there may be some way to remove the GIL and keep refcounting with its "nice properties", but here, we discover why they would get fundamentally braindead semantics.

Hey, this discussion could even be relevant about Google's new Go language, since they plan to use a (somehow smart) reference counting, as shown in their roadmap. The below discussion however still refers to CPython, since I never studied the details of deferred reference counting schemes. And it seems that Recycler (the algorithm they want to implement) does not suffer of this problem, luckily. Still, it is probably still interesting to see the interplay of refcounting with other features. And this probably affects the reference counting scheme used in COM.

Let's suppose a version of CPython using refcounting for garbage collection, but without a Global Interpreter Lock, is produced somehow. I assume that a decently performing implementation would use atomic counters for refcounts, rather than protecting those counts with a lock. Let's call it "GIL-less refcounted CPython".

Now, clearing a reference in GIL-less refcounted CPython (i.e. setting it to null) would not be atomic. In fact, suppose T1 does this, in a situation where another thread T2 can access the pointer. T2 can read the pointer just before you clear it, and increment the refcount of the pointed-to object O1 just after T1 deleted O1 and its memory was reused for another object O2 (so no, you don't see a zero refcount, you just corrupt something unrelated). Possibly, those other threads may even try to modify O1 and end up modifying O2.

Quoting from its post:

"Why is ref-counting hard to use in managing concurrent* structures?"

The usual mistake is to put the count in the structure being managed. The failing scenario is where one thread gets a pointer to the ref-counted structure at about the same time as the last owner is lowering the count and thus preparing to delete. Timeline: T1 gets a ptr to the ref-counted structure. T2 has the last read-lock on the structure and is done with it. T2 lowers the count to zero (T1 still holds a ptr to the structure but is stalled in the OS). T2 frees the structure (T1s pointer is now stale). Some other thread calls malloc and gets the just-freed memory holding the structure. T1 wakes up and increments where the ref-count used to be (but is now in some other threads memory).

Since in CPython refcounts are embedded in objects (and it can't be done in other ways), this problem applies.

*I think he really means, in the title, lockless structures, otherwise you can just use a lock to clear the pointer and free the object atomically wrt. readers. And that's the way you'd solve this in CPython: protect any pointer reachable by other threads with a lock.

While the above reasoning just says "hey, if you keep refcounting, removing the GIL will give weird semantics", there is more to notice. Namely, the Python program would probably segfault, or experience undefined behavior. And that's not supposed to happen in a managed language, no matter how buggy is your script, unlike in C. I.e. Python is supposed to be a safe language.

Contrast this with the Java Memory Model. The JMM guarantees that reference updates are atomic even on 64bit platforms, so clearing a pointer is always atomic. Additionally, concurrency mistakes never cause totally undefined behavior.

This is inherited even by the Jython memory model, and in absence of documentation, Jython users know they can rely on this both on Jython and on CPython. Our GIL-less refcounted CPython would break this.

The same scenario could also happens when making the reference point to a different object, but doing this without a lock is not always safe. For instance, to make sure that "this.o = new C();" is safe, the field 'o' must be volatile, or the access must be locked, to ensure that the writes initializing the new object are visible before the write to the field 'o'.

Python VM: refcounting and the GIL

2009-01-07T23:54:00.000+01:00

What has been more surprising for me, when investigating about Python history, was that real multithreading is not supported. Better said, if you create multiple thread, they cannot run concurrently - the interpreter allows just one of them to run most of the time, because all the threads hold the Global Interpreter Lock most of the time.

As discussed by an official FAQ entry, removal of the GIL was attempted, but it gave a big slowdown (2x on Windows platforms, more on Linux, which was slower for locking at that time; NPTL probably fixed that). I don't remember any profiling results, but the discussion on the python-dev ML focused on what I think is the right problem: updating object refcounts in an atomic way. Luckily you don't need to acquire a lock for that, atomic operations suffice. But still, that's the biggest cost, and it has such a big impact. How can other languages (like Java) implement multithreading?

The problem is that CPython is stuck with using the slow technique of reference counting instead of fast, modern, real GCs, so removing the GIL for CPython makes it twice as slow. Fine-grained threading is slightly slower, but not by that factor. Atomic manipulation of reference counts is what killed performance of GIL-free CPython (I can't provide benchmarks for this), and reference counting already slows down Python.

Python developers still defend reference counting for its improved locality, which can improve performance; the reality is that the advantages of reduced memory usage and relative ease of implementation are outweighed by the caused slowdown, as shown by years of research and by practice of most current implementations.

And even the ease of implementation issue, for extension developers, is debatable; I would like to compare Python C extension API with the one of OcaML, which has been described as powerful and easy to use by some of my colleagues, and is thought for usage of real garbage collector, but I didn't have time for that yet.

Memory usage is the _only_ real drawback of garbage collectors: to get good performance (even better than C explicit memory management) the heap size must be bigger, so that the collector is not invoked too often, and can reclaim lots of garbage for each scan.

Python VM simplicity

2009-01-07T06:37:00.000+01:00

It has been quite fun, while studying for the Virtual Machines course with Lars Bak (the head designer of Google V8) in Aarhus University, and working for the project, to dig in the archives of python-dev about Python history and its choices.

For a VM developer, like for authors of optimizing compilers, the aim is to relieve developers from concerns about low-level optimizations, allowing them to concentrate on their programs and their high-level optimizations.

It looks like Guido van Rossum has instead the concern to keep the Python interpreter (the main implementation, called CPython - other implementations exist) simple. So simple that a couple of groups in the course could prototype interpreters which are faster than the Python one. Obviously, support for some advanced features is missing, but at least my group tried to include everything having impact in the fast path.
I'm not saying no optimization work is done on CPython, obviously, just complaining on the amount of it.

And especially, I'm complaining about how braindead is to suggest developers who care about performance to write performance-critical code in C:

That leads to more complexity for the users, because they have to give up simplicity or performance, or to switch to Jython (which is 2x slower than Python but supports thread) or to Java (which is anyway simpler than Python+C).
That leads to more maintainance costs for the Python VM itself, because the amount of extension code written in C is much bigger.
Also, having lots of independent developers writing extensions in C makes the transition costs for API changes even higher.

I want to elaborate on maintainance costs. One could think the tradeoff is just between the complexity of the Python VM core and of extensions. Actually, it's far worse. It's a tradeoff of manpower.
There will likely be a few core developers working on the interpreter, and they're likely going to be the most experienced ones. And the core of an interpreter tends to stay small anyway.
Instead, authors of extensions (especially independent ones) will likely have a more varied skill set. And extensions can be many more, i.e. much more code.

It's like this also on the Linux kernel: while on the core exceptionally complex optimizations are accepted (when they do improve performance) and people still tend to get it right, there is much more driver code, much simpler, but often of poor quality, and causing most bugs.

To show some numbers, in Python 2.7 SVN, I have 9.8M in Modules/ folder (C module code) and 1.3M in the Python folder (C interpreter code).
Google V8, instead, writes as much as possible of the standard library in JavaScript, and saves the JIT-ed code to load it at startup.

Finally, the implementation cost have been overestimated. Implementing a JIT, for instance, is not that complex. There are also possibilities to get a JIT from the interpreter, by basically pasting together the pieces of the interpreter that handle the various opcodes (see for instance "context threaded JIT", mentioned in this post about Apple Safari's new JavaScript VM:
http://webkit.org/blog/214/introducing-squirrelfish-extreme/).

A technique called "indirect threading" (having nothing to do with multithreading) adds at least some of that performance, and still it wasn't implemented (though this is maybe going to change).

And finally, even a complete rewrite would not be as horrible as Python people think. Other big opensource projects are doing that all the time, or face extensive refactoring and changes of their core API quite often, like the Linux kernel; and also the Samba code has been undergoing a core some rewrite time ago.