Wednesday, January 7, 2009

Python VM: refcounting and the GIL

What has been more surprising for me, when investigating about Python history, was that real multithreading is not supported. Better said, if you create multiple thread, they cannot run concurrently - the interpreter allows just one of them to run most of the time, because all the threads hold the Global Interpreter Lock most of the time.

As discussed by an official FAQ entry, removal of the GIL was attempted, but it gave a big slowdown (2x on Windows platforms, more on Linux, which was slower for locking at that time; NPTL probably fixed that). I don't remember any profiling results, but the discussion on the python-dev ML focused on what I think is the right problem: updating object refcounts in an atomic way. Luckily you don't need to acquire a lock for that, atomic operations suffice. But still, that's the biggest cost, and it has such a big impact. How can other languages (like Java) implement multithreading?

The problem is that CPython is stuck with using the slow technique of reference counting instead of fast, modern, real GCs, so removing the GIL for CPython makes it twice as slow. Fine-grained threading is slightly slower, but not by that factor. Atomic manipulation of reference counts is what killed performance of GIL-free CPython (I can't provide benchmarks for this), and reference counting already slows down Python.

Python developers still defend reference counting for its improved locality, which can improve performance; the reality is that the advantages of reduced memory usage and relative ease of implementation are outweighed by the caused slowdown, as shown by years of research and by practice of most current implementations.

And even the ease of implementation issue, for extension developers, is debatable; I would like to compare Python C extension API with the one of OcaML, which has been described as powerful and easy to use by some of my colleagues, and is thought for usage of real garbage collector, but I didn't have time for that yet.

Memory usage is the _only_ real drawback of garbage collectors: to get good performance (even better than C explicit memory management) the heap size must be bigger, so that the collector is not invoked too often, and can reclaim lots of garbage for each scan.

Python VM simplicity

It has been quite fun, while studying for the Virtual Machines course with Lars Bak (the head designer of Google V8) in Aarhus University, and working for the project, to dig in the archives of python-dev about Python history and its choices.

For a VM developer, like for authors of optimizing compilers, the aim is to relieve developers from concerns about low-level optimizations, allowing them to concentrate on their programs and their high-level optimizations.

It looks like Guido van Rossum has instead the concern to keep the Python interpreter (the main implementation, called CPython - other implementations exist) simple. So simple that a couple of groups in the course could prototype interpreters which are faster than the Python one. Obviously, support for some advanced features is missing, but at least my group tried to include everything having impact in the fast path.
I'm not saying no optimization work is done on CPython, obviously, just complaining on the amount of it.

And especially, I'm complaining about how braindead is to suggest developers who care about performance to write performance-critical code in C:
  • That leads to more complexity for the users, because they have to give up simplicity or performance, or to switch to Jython (which is 2x slower than Python but supports thread) or to Java (which is anyway simpler than Python+C).
  • That leads to more maintainance costs for the Python VM itself, because the amount of extension code written in C is much bigger.
  • Also, having lots of independent developers writing extensions in C makes the transition costs for API changes even higher.
I want to elaborate on maintainance costs. One could think the tradeoff is just between the complexity of the Python VM core and of extensions. Actually, it's far worse. It's a tradeoff of manpower.
There will likely be a few core developers working on the interpreter, and they're likely going to be the most experienced ones. And the core of an interpreter tends to stay small anyway.
Instead, authors of extensions (especially independent ones) will likely have a more varied skill set. And extensions can be many more, i.e. much more code.

It's like this also on the Linux kernel: while on the core exceptionally complex optimizations are accepted (when they do improve performance) and people still tend to get it right, there is much more driver code, much simpler, but often of poor quality, and causing most bugs.

To show some numbers, in Python 2.7 SVN, I have 9.8M in Modules/ folder (C module code) and 1.3M in the Python folder (C interpreter code).
Google V8, instead, writes as much as possible of the standard library in JavaScript, and saves the JIT-ed code to load it at startup.

Finally, the implementation cost have been overestimated. Implementing a JIT, for instance, is not that complex. There are also possibilities to get a JIT from the interpreter, by basically pasting together the pieces of the interpreter that handle the various opcodes (see for instance "context threaded JIT", mentioned in this post about Apple Safari's new JavaScript VM:
http://webkit.org/blog/214/introducing-squirrelfish-extreme/).

A technique called "indirect threading" (having nothing to do with multithreading) adds at least some of that performance, and still it wasn't implemented (though this is maybe going to change).

And finally, even a complete rewrite would not be as horrible as Python people think. Other big opensource projects are doing that all the time, or face extensive refactoring and changes of their core API quite often, like the Linux kernel; and also the Samba code has been undergoing a core some rewrite time ago.