Blaisorblade programming thoughts: June 2010

Saturday, June 12, 2010

Teach Yourself Programming in Ten Years

I came across this link by random browsing (Internet, you know?) and it's just wonderful.
Just one quote, about the "Teach Yourself $Language_X in $small_number_of_days" books: "Bad programming is easy. Idiots can learn it in 21 days, even if they are dummies."

http://norvig.com/21-days.html

Wednesday, June 9, 2010

Stereotypes, averages and benchmarks

While reasoning on the idea of stereotypes and how useful they are to understand a different culture, I realized that a stereotype, together with problems in using it, is at least as as bad as an average over a diverse set (here we'll ignore the fact that it's also a judgement by a culture on another culture, often presuming without reason that). And as we know,

And after realizing this, one sees connections with many issues in research and in Computer Science. For instance, averages of different benchmarks. It's often fine to benchmark two different algorithms on some data sets: if the algorithm domain is simple enough, the variation on different inputs is small, and the input can anyway be characterized by a small number of features. Think of sorting algorithms: the inputs you pass them can be random, Gaussian, ordered, reverse-ordered, and that's it more or less. OK, I'm oversimplifying a bit, I guess. But in other cases, the input space has a much higher number of dimensions.

And then, even worse, benchmarks about programming languages. Not only we have all these problems about sampling the input space, but in this case the input spaces of two language implementations, for different languages, are not the same, and they don't match in a meaningful way: they are not isomorphic, just like spoken languages. In other words, there is not a unique way of translating a program, and when you do, there's no way to make the pieces match one-by-one. There are infinite ways to solve a problem in any Turing-complete language (and a not-so-small number of reasonable alternative), and an intrinsically different infinity of ways for another language. And maybe, your translation is too slow because it's not appropriate for that language, and you should write the same program in a more idiomatic style.

The same concept is expressed here in a less abstract way.

And in this situation, not only we have actual benchmarks about programming languages, but even performance myths for or against Java, C#, Python, C, and so on. I.e., stereotypes, once again, about languages. And this time, mostly false.

For instance, we could talk of people thinking that Java is slow and Python is more lightweight, while it's actually the other way around, as long as we speak of throughput. What those people think is only true about startup latency, and only partially about memory consumption (Python has prompter garbage collection because of refcounting, but its object layout could benefit some huge improvement). And now in this example, we see that not only the input space is high-dimensional, but that even the performance space cannot be characterized by a single number. To compare memory consumption, we need to give a function of the used memory for a given object set, for Java and for Python. And the object sets are, again, not isomorphic! We're over.

Trying to sum up too much any result is going to give us lies. We can't help it; we should stop asking for simple answers to hard questions, and for silver bullets, and for a lot of other easy things, and go instead to work hard and enjoy the results.

Why conservative garbage collection should be avoided

Sometimes one still sees around conservative garbage collection. I do not know how common that is, but Mono does use it, as did many JS implementation in the pre-V8 era (according to V8 website). And that's really a pity!

A quick intro first: conservative GC means, by definition, that since you don't know which words are pointers and which are just random other stuff, you are conservative and just assume they might all be pointers.

There are two reasons why conservative GC is bad:

First, and most obvious: you might have false positives, so you keep in memory more objects than you should. I'm not really sure how common this is, and how relevant. But it's not the really bad thing. Two other problems come into play!
You may not write a compacting collector, i.e. one that moves objects to be consecutive in memory at garbage collection time. Such a collector allows memory allocation to be much faster than with malloc()/free(): if all allocated objects form a sequence, also free memory does. Then you know where you can get free memory.
In contrast, malloc()/free() have to manage free lists of some kinds, which list the available blocks of free memory, and perform a search through them for a block of the appropriate size; the actual details of this vary, depending on the used algorithm, but it's anyway slower. And the fact that memory allocation is much faster with GC is one of the coolest things about it. Depending on the scenario, a program might even be faster with GC, and there are papers about that.
Another problem is that if you don't compact memory, you get memory fragmentation: free memory is fragmented in more small blocks that may be of little use when allocating a bigger object, and still may not be used by any other program on the system. It's theoretically possible that a substantial fraction of the used memory of your app is just wasted by fragmentation, even if it's not so easy for me to find a realistic scenario where this happens.