What has been more surprising for me, when investigating about Python history, was that real multithreading is not supported. Better said, if you create multiple thread, they cannot run concurrently - the interpreter allows just one of them to run most of the time, because all the threads hold the Global Interpreter Lock most of the time.
As discussed by an official FAQ entry, removal of the GIL was attempted, but it gave a big slowdown (2x on Windows platforms, more on Linux, which was slower for locking at that time; NPTL probably fixed that). I don't remember any profiling results, but the discussion on the python-dev ML focused on what I think is the right problem: updating object refcounts in an atomic way. Luckily you don't need to acquire a lock for that, atomic operations suffice. But still, that's the biggest cost, and it has such a big impact. How can other languages (like Java) implement multithreading?
The problem is that CPython is stuck with using the slow technique of reference counting instead of fast, modern, real GCs, so removing the GIL for CPython makes it twice as slow. Fine-grained threading is slightly slower, but not by that factor. Atomic manipulation of reference counts is what killed performance of GIL-free CPython (I can't provide benchmarks for this), and reference counting already slows down Python.
Python developers still defend reference counting for its improved locality, which can improve performance; the reality is that the advantages of reduced memory usage and relative ease of implementation are outweighed by the caused slowdown, as shown by years of research and by practice of most current implementations.
And even the ease of implementation issue, for extension developers, is debatable; I would like to compare Python C extension API with the one of OcaML, which has been described as powerful and easy to use by some of my colleagues, and is thought for usage of real garbage collector, but I didn't have time for that yet.
Memory usage is the _only_ real drawback of garbage collectors: to get good performance (even better than C explicit memory management) the heap size must be bigger, so that the collector is not invoked too often, and can reclaim lots of garbage for each scan.