Blaisorblade programming thoughts: 2010

Wednesday, July 28, 2010

Automatic race detection

I wanted to recommend a cool Google tool, based on Valgrind, which performs automatic race detection:

http://code.google.com/p/data-race-test/wiki/ThreadSanitizer

Also DRD and Helgrind exist (both part of Valgrind), but I've seen Helgrind considered "not very stable" more often than not (and it's still maybe a bit experimental). While ThreadSanitizer seems to be in active development and use, and it is supported by Google. And it even has Vim integration!
I remembered of this during this discussion about something similar to race detection, in a context halfway between Python and share-nothing concurrency: http://codespeak.net/pipermail/pypy-dev/2010q3/006037.html
I hope to get your comments on this tool.

Tuesday, July 27, 2010

Pretending you know what you're talking about - a book review

Some books in Computer Science seem to do just that: pretending they know what they are talking about, when they in fact do not.

A quote on this, not restricted to Java books:

Most Java books are written by people who couldn't get a job as a Java programmer (since programming almost always pays more than book writing; I know because I've done both). These books are full of errors, bad advice, and bad programs.

Peter Norvig, Director of Research at Google, Inc.

I have seen such books myself. But they do not get great reviews at Amazon and they do not get slashdotted.

Instead, Virtual Machine Design and Implementation in C/C++ does. Its author is great at pretending he knows what he is talking about. I started looking at it for my seminar on Virtual Machines, hoping to make some use of it, and I was deeply disappointed. Looking at the table of contents and index, Virtual Machines: Versatile Platforms for Systems and Processes seems to be a better book on the topic , which covers also system virtualization (a different topic having the same name), however I cannot really judge.
Virtual Machines, by Iain D. Craig, seems instead devoted to semantic issues, and I am not qualified to judge that topic, only to say it is different.
Back to the first book, after reading sample pages from Amazon preview (mostly the table of contents and the index), together with all Amazon reviews, I realized what is happening here.

So, the rest of this post is a (meta-?)review against this book - which is much less interesting, unless you were actually considering to buy it.

The author does know what he is talking about, and spent lots of time polishing it, he's just totally unaware that it is completely unrelated to the title of the book, since he has no experience in the field. Reading the introduction after reading the above quote is enlightening - the author mentions being poor (page xvii), and his experience (described at page iv), like writing CASE tools in Java, is totally unrelated - if he couldn't get a better job, I'd say Norvig's quote is exactly about him. And after reading this, the mention he makes about when "he used to disable paging and run on pure SDRAM" smells of a lamer wanting to show off (in other contexts, it could be just a joke, I know).

The author is just trying to learn by himself how to implement an emulator, and writing a diary on this. If you care about real Virtual Machines (Parrot, the JVM, .NET, and so on), you need entirely different material. Say, JIT compilation. Other reviews mention some more points which are missing, but none of them had a real introduction to the field, so they are not aware of how much else is not in the book. Actually, maybe he knows how a VM looks from outside, but his attempts to move in that direction (like support for multithreading) look quite clumsy - he talks about that and then does not manage to implement it.

Finally, the author seems to be an assembler programmer who is programming assembler in C++. As we remember, it is famously known that "the determined Real Programmer can write Fortran programs in any language", and it is still true with Assembler. Things like manual loop unrolling on debug statements (mentioned in reviews) are quite funny.

In the end, I'd recommend this book to nobody - it might contain some interesting stuff about the actual topic, asa acknowledged by some reviews, but I would not buy a book because of this hope. Especially, not for who cares about Virtual Machines.

Monday, July 26, 2010

JVM vs .NET CLR - a comparison between the VMs

I was reading a Java vs. F# comparison on this post, and it ended up comparing the JVM vs. the CLR. It also tries to counter some points of a post by Cliff Click, but it does so in a bad way. That's why I try here to improve on that comparison.

The three real limitations of the JVM, compared to the CLR, are the lack of:

tail calls (being addressed), which are important for functional languages running on top of the JVM, like Clojure and Scala
value types: I was happy to read on that post that there's interest on that, and I already read this on Guy Steele's Growing a Language.
proper generics, without erasure: problems were well-known when generics were introduced, but back-compatibility with binary libraries forced the current solution, so this one is not going to be solved.

As an additional, it seems to me that since Java and the JVM is not managed by a single company, but by the Java Community Process, addition of new features like these is much slower (but hopefully they are better designed).

Combining generics with value types would allow great memory (and thus performance) savings: one could define Pair as a value type and then use an ArrayList> as the backing storage for an hash table and pay no space overhead.

An additional point for dynamic languages, the lack of an invokedynamic primitive creates significant performance problems - for instance, Jython (a JITted language) is 2x slower than CPython (an interpreted language with a slow interpreter). Anecdotal evidence suggests that lack of support for inline caching is an important reason: namely, a reimplementation of Python on top of the JVM for a course project, which allowed inline caching, was much faster.

About JVM vs CLR JITs, discussing the quality of existing implementations: Cliff Click mentioned old anecdotal evidence of the CLR JIT being slower because .NET is geared towards static compilation and not so much effort has been put into it. I guess that Click refers to some optimizations which are missing from the CLR. For instance, at some point in the past replacing List (a class type) by IList (an interface type), in C#, caused a 10x slowdown that a good JIT compiler is able to remove, as discussed in Click's blog post. Dunno if this still holds.

Anyway, this comparison is about the JVM vs. the Microsoft CLR implementation, running only on Windows. Mono is available for Linux, but it uses a partially conservative garbage collector based on Boehm's GC, and this gives really inferior results in some cases. The Boehm's authors claim here that such cases can be easily fixed by changing the source program, but this is valid only when the application is written for this GC, not when you want to support programs written for an accurate GC.
Some evidence about this, where the described worst case is implemented, is available here.
In the end, if you want to run over Linux, you have no real choice if you aim for the best and most robust performance, currently (as also suggested by this entry of the Language Shootout, but read their notes about the flawedness of such comparisons). We have to hope for improvements in Mono.

Saturday, June 12, 2010

Teach Yourself Programming in Ten Years

I came across this link by random browsing (Internet, you know?) and it's just wonderful.
Just one quote, about the "Teach Yourself $Language_X in $small_number_of_days" books: "Bad programming is easy. Idiots can learn it in 21 days, even if they are dummies."

http://norvig.com/21-days.html

Wednesday, June 9, 2010

Stereotypes, averages and benchmarks

While reasoning on the idea of stereotypes and how useful they are to understand a different culture, I realized that a stereotype, together with problems in using it, is at least as as bad as an average over a diverse set (here we'll ignore the fact that it's also a judgement by a culture on another culture, often presuming without reason that). And as we know,

And after realizing this, one sees connections with many issues in research and in Computer Science. For instance, averages of different benchmarks. It's often fine to benchmark two different algorithms on some data sets: if the algorithm domain is simple enough, the variation on different inputs is small, and the input can anyway be characterized by a small number of features. Think of sorting algorithms: the inputs you pass them can be random, Gaussian, ordered, reverse-ordered, and that's it more or less. OK, I'm oversimplifying a bit, I guess. But in other cases, the input space has a much higher number of dimensions.

And then, even worse, benchmarks about programming languages. Not only we have all these problems about sampling the input space, but in this case the input spaces of two language implementations, for different languages, are not the same, and they don't match in a meaningful way: they are not isomorphic, just like spoken languages. In other words, there is not a unique way of translating a program, and when you do, there's no way to make the pieces match one-by-one. There are infinite ways to solve a problem in any Turing-complete language (and a not-so-small number of reasonable alternative), and an intrinsically different infinity of ways for another language. And maybe, your translation is too slow because it's not appropriate for that language, and you should write the same program in a more idiomatic style.

The same concept is expressed here in a less abstract way.

And in this situation, not only we have actual benchmarks about programming languages, but even performance myths for or against Java, C#, Python, C, and so on. I.e., stereotypes, once again, about languages. And this time, mostly false.

For instance, we could talk of people thinking that Java is slow and Python is more lightweight, while it's actually the other way around, as long as we speak of throughput. What those people think is only true about startup latency, and only partially about memory consumption (Python has prompter garbage collection because of refcounting, but its object layout could benefit some huge improvement). And now in this example, we see that not only the input space is high-dimensional, but that even the performance space cannot be characterized by a single number. To compare memory consumption, we need to give a function of the used memory for a given object set, for Java and for Python. And the object sets are, again, not isomorphic! We're over.

Trying to sum up too much any result is going to give us lies. We can't help it; we should stop asking for simple answers to hard questions, and for silver bullets, and for a lot of other easy things, and go instead to work hard and enjoy the results.

Why conservative garbage collection should be avoided

Sometimes one still sees around conservative garbage collection. I do not know how common that is, but Mono does use it, as did many JS implementation in the pre-V8 era (according to V8 website). And that's really a pity!

A quick intro first: conservative GC means, by definition, that since you don't know which words are pointers and which are just random other stuff, you are conservative and just assume they might all be pointers.

There are two reasons why conservative GC is bad:

First, and most obvious: you might have false positives, so you keep in memory more objects than you should. I'm not really sure how common this is, and how relevant. But it's not the really bad thing. Two other problems come into play!
You may not write a compacting collector, i.e. one that moves objects to be consecutive in memory at garbage collection time. Such a collector allows memory allocation to be much faster than with malloc()/free(): if all allocated objects form a sequence, also free memory does. Then you know where you can get free memory.
In contrast, malloc()/free() have to manage free lists of some kinds, which list the available blocks of free memory, and perform a search through them for a block of the appropriate size; the actual details of this vary, depending on the used algorithm, but it's anyway slower. And the fact that memory allocation is much faster with GC is one of the coolest things about it. Depending on the scenario, a program might even be faster with GC, and there are papers about that.
Another problem is that if you don't compact memory, you get memory fragmentation: free memory is fragmented in more small blocks that may be of little use when allocating a bigger object, and still may not be used by any other program on the system. It's theoretically possible that a substantial fraction of the used memory of your app is just wasted by fragmentation, even if it's not so easy for me to find a realistic scenario where this happens.

Tuesday, May 4, 2010

Refcounting and weirdness (in Python and Go)

Reading Cliff Click's blog is always insightful. Thanks to his blog, I realized that a refcounted language would have a weird enough memory model, if it allowed real multithreading. Better yet: CPython folks still think there may be some way to remove the GIL and keep refcounting with its "nice properties", but here, we discover why they would get fundamentally braindead semantics.

Hey, this discussion could even be relevant about Google's new Go language, since they plan to use a (somehow smart) reference counting, as shown in their roadmap. The below discussion however still refers to CPython, since I never studied the details of deferred reference counting schemes. And it seems that Recycler (the algorithm they want to implement) does not suffer of this problem, luckily. Still, it is probably still interesting to see the interplay of refcounting with other features. And this probably affects the reference counting scheme used in COM.

Let's suppose a version of CPython using refcounting for garbage collection, but without a Global Interpreter Lock, is produced somehow. I assume that a decently performing implementation would use atomic counters for refcounts, rather than protecting those counts with a lock. Let's call it "GIL-less refcounted CPython".

Now, clearing a reference in GIL-less refcounted CPython (i.e. setting it to null) would not be atomic. In fact, suppose T1 does this, in a situation where another thread T2 can access the pointer. T2 can read the pointer just before you clear it, and increment the refcount of the pointed-to object O1 just after T1 deleted O1 and its memory was reused for another object O2 (so no, you don't see a zero refcount, you just corrupt something unrelated). Possibly, those other threads may even try to modify O1 and end up modifying O2.

Quoting from its post:

"Why is ref-counting hard to use in managing concurrent* structures?"

The usual mistake is to put the count in the structure being managed. The failing scenario is where one thread gets a pointer to the ref-counted structure at about the same time as the last owner is lowering the count and thus preparing to delete. Timeline: T1 gets a ptr to the ref-counted structure. T2 has the last read-lock on the structure and is done with it. T2 lowers the count to zero (T1 still holds a ptr to the structure but is stalled in the OS). T2 frees the structure (T1s pointer is now stale). Some other thread calls malloc and gets the just-freed memory holding the structure. T1 wakes up and increments where the ref-count used to be (but is now in some other threads memory).

Since in CPython refcounts are embedded in objects (and it can't be done in other ways), this problem applies.

*I think he really means, in the title, lockless structures, otherwise you can just use a lock to clear the pointer and free the object atomically wrt. readers. And that's the way you'd solve this in CPython: protect any pointer reachable by other threads with a lock.

While the above reasoning just says "hey, if you keep refcounting, removing the GIL will give weird semantics", there is more to notice. Namely, the Python program would probably segfault, or experience undefined behavior. And that's not supposed to happen in a managed language, no matter how buggy is your script, unlike in C. I.e. Python is supposed to be a safe language.

Contrast this with the Java Memory Model. The JMM guarantees that reference updates are atomic even on 64bit platforms, so clearing a pointer is always atomic. Additionally, concurrency mistakes never cause totally undefined behavior.

This is inherited even by the Jython memory model, and in absence of documentation, Jython users know they can rely on this both on Jython and on CPython. Our GIL-less refcounted CPython would break this.

The same scenario could also happens when making the reference point to a different object, but doing this without a lock is not always safe. For instance, to make sure that "this.o = new C();" is safe, the field 'o' must be volatile, or the access must be locked, to ensure that the writes initializing the new object are visible before the write to the field 'o'.