Wednesday, January 7, 2009

Python VM: refcounting and the GIL

What has been more surprising for me, when investigating about Python history, was that real multithreading is not supported. Better said, if you create multiple thread, they cannot run concurrently - the interpreter allows just one of them to run most of the time, because all the threads hold the Global Interpreter Lock most of the time.

As discussed by an official FAQ entry, removal of the GIL was attempted, but it gave a big slowdown (2x on Windows platforms, more on Linux, which was slower for locking at that time; NPTL probably fixed that). I don't remember any profiling results, but the discussion on the python-dev ML focused on what I think is the right problem: updating object refcounts in an atomic way. Luckily you don't need to acquire a lock for that, atomic operations suffice. But still, that's the biggest cost, and it has such a big impact. How can other languages (like Java) implement multithreading?

The problem is that CPython is stuck with using the slow technique of reference counting instead of fast, modern, real GCs, so removing the GIL for CPython makes it twice as slow. Fine-grained threading is slightly slower, but not by that factor. Atomic manipulation of reference counts is what killed performance of GIL-free CPython (I can't provide benchmarks for this), and reference counting already slows down Python.

Python developers still defend reference counting for its improved locality, which can improve performance; the reality is that the advantages of reduced memory usage and relative ease of implementation are outweighed by the caused slowdown, as shown by years of research and by practice of most current implementations.

And even the ease of implementation issue, for extension developers, is debatable; I would like to compare Python C extension API with the one of OcaML, which has been described as powerful and easy to use by some of my colleagues, and is thought for usage of real garbage collector, but I didn't have time for that yet.

Memory usage is the _only_ real drawback of garbage collectors: to get good performance (even better than C explicit memory management) the heap size must be bigger, so that the collector is not invoked too often, and can reclaim lots of garbage for each scan.


  1. great post.

    unfortunatelly Unladden Swallow isn't yet addressing this problem:

    I tried Jython (as for multiprocessing though), but it doesn't have neither multiprocessing module nor supports pprocess...

    and the problem is still seen as applied optimisation domain ;)

  2. @vak: first, thanks for the comment (you're the first to comment my blog).

    About Jython, your statement seems contradictory: Jython has no GIL so you can use standard multithreading, while multiprocessing uses multiple processes, so it works happily in CPython. So, either you use CPython and multiprocessing, or you use Jython and threading; the docs claim they have a similar API, so a performance comparison should be easy; note that Jython is normally slower, it could have a better scalability, so I'd use Jython only for a high thread number, or if the working set of the problem is huge (forking does not mean copying the Python heap, but reference count updates cause that, as explained in the first thread you link).

    I fully agree with the reasons for which Unladen Swallow doesn't plan yet to remove the GIL.
    Also, what I said above does not address all performance concerns of GIL removal; each object in Python is a dictionary, and locking that dictionary would be slow if done naively. I have ideas for that, but I have yet to develop them.

  3. CPython and multiprocessing ain't a great love because of ref counters -- memory tends to explode where it is usually shared in other language implementations.

    P.S. do you know already about PyPy 1.2 that appeared today?

  4. That's why I suggested using Jython for huge working sets.
    About PyPy, yes, I'm subscribed to their list, but I was quite disappointed by their reception of my ideas on GIL removal, or on some other issues. Their effort is herculean, and now I hope they will succeed, but it seems that in some aspects they're reinventing the wheel of JIT compilation. They seemed a bit ignorant of history after the course with Lars Bak.

    For instance, they mentioned hash tables for method lookup as a big improvement for them, while in our course it was the slowest and most basic optimization in this respect. In that course, some of us wrote proof-of-concept interpreters much faster than CPython (and it's not the first time, nor that hard with the right choices), and hope that in the interpreter of my group I will be able to prove my ideas about GIL removal.