Saturday, January 19, 2013

Flexible Implicits (from Agda) for Scala

[This is my first blog post over my new favorite programming language, Scala. The target audience for this message are advanced Scala users and programming language researchers with knowledge of the language.]

Scala's implicit syntax has a few limitations. It's not just ugly - it reduces both elegance and expressiveness, especially in combination with path-dependent types. In this blog post, I will propose a general solution, taken from a  language feature of Agda, a fully dependently-typed language. I even mention (very briefly) how my proposal allows extending lightweight modular staging (LMS) (Rompf and Odersky, 2010) to describe computations on different locations.

Problem #1: Bakery of doom

Let us consider this Scala code (a simplification of code which you end up writing when using macros - the real Context type is scala.reflect.macros.Context):

trait Context {
  type Tree
}
def treeTransf1(c: Context)(tree: c.Tree): c.Tree
def treeTransf2(c: Context)(tree: c.Tree): c.Tree
def treeTransf(c: Context)(tree: c.Tree): c.Tree =
  treeTransf2(c)(treeTransf1(c)(tree))

In this code, what we really want to pass around is tree, but since it has a path-dependent type, namely c.Tree , we need to also pass around c: Context. When this question was raised among scala hackers, more than one proposed allowing this syntax:

def treeTransf1(tree: c.Tree)(implicit c: Context): c.Tree

Adriaan Moors then explained that one could think of the above as:

def treeTransf1[val c: Context](tree: c.Tree): c.Tree

because implicits are similar to type parameters, but did not propose actually using such a syntax. I'm proposing that syntax instead, or rather a variant:

def treeTransf1[[c: Context]](tree: c.Tree): c.Tree

Continuing the above example, I'd write:

def treeTransf2[[c: Context]](tree: c.Tree): c.Tree
def treeTransf[[c: Context]](tree: c.Tree): c.Tree =
  treeTransf2(treeTransf1(tree))

I will explain why it solves other problems as well. I am aware that the bakery of doom can be worked around through other solutions. However, generalizing implicits is more general and elegant.

What's the new syntax?

  • It introduces new-style implicit parameter list, signaled by double brackets [[paramName: ParamType]].
  • Such lists can be at any position among other parameter lists.
  • Analogously, I propose to allow lists of type parameters at any position. This way, type parameter bounds will be able to refer to parameters in previous parameter lists.
  • To specify explicitly an implicit argument for new-style implicit parameters, you'll need to bracket the argument with double brackets [[]] again.
  • If you are worrying about source compatibility, you needn't: this is an alternative syntax, which can coexist with the existing one.
  • In all the examples given, such implicit parameters need to be inferred based on later parameters, using unification like type parameters. Unlike in Agda, only implicit values can be inferred in the current proposal — an alternative would be to allow values which are not marked as implicits if unification finds them as a unique solution.

Why is it simpler?

  • It removes an arbitrary limitation, that is, having a single list of implicit parameters as the last one. Already today, I can workaround this arbitrary limitation by writing:
    def bippy(arg1: Foo)(implicit arg2: Bar): Bar2 => Bippy
    but such a function cannot be used conveniently; moreover, since it returns a function, it pays the overhead of partial function application. A call to the analogous:
    def bippy(arg1: Foo)[[arg2: Bar]](arg3: Bar2): Bippy
    would be erased to a standard method call.
  • In particular, this is simpler than the solution which was proposed:
    def treeTransf1(tree: c.Tree)(implicit c: Context): c.Tree
    which would be hard to support in Scalac, since parameters are used before being passed, and cannot be supported using the dependent function type constructor Pi, standard in dependently-typed lambda calculi.

Problem #2: for what parameter is this argument?

Sometimes, for syntactic reasons, you need to specify implicits arguments explicitly - that's ugly and unnecessary.

def buildMap[K, V](data: Seq[(K, V)])(implicit ordK: Ordering[K]): Map[K, V]

Let's try to call this function and extract immediately an entry from this map:

val m = buildMap(List(("Italy", "Euro"), ("US", "Dollar")))
m("Italy")
But suppose we never use m again.  We'd like to rewrite the above code as:

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))("Italy")

but we can't - Scalac thinks "Italy" is an argument for the ordK parameter, and complains because it has the wrong type. The reason is that at the call site implicit and explicit applications cannot be distinguished. With the extension I propose, we could write instead:

def buildMap[K, V](data: Seq[(K, V)])[[ordK: Ordering[K]]]: Map[K, V]

and call this as

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))("Italy")

We can still specify ordK explicitly, but we need to use a different syntax for these implicit arguments.

buildMap(List(("Italy", "Euro"), ("US", "Dollar")))[[implicitly[Ordering[String]]]]("Italy")

Here I just passed implicitly[Ordering[String]] for simplicity, but other values are possible.

Problem #3: tagging types to separate different worlds/regions/locations

Problem #1 is described as a consequence of using the cake pattern. However, the same problem can arise in different scenarios.
Suppose you want to divide values of some type in different incompatible worlds and enforce statically that they cannot be mixed. While the need for this may not arise as often, one example is in the reflection/macro API, where code being compiled (for instance in macros) should not be mixed with code being run (accessible through reflection) — see the documentation on reflection universes. For instance, in the above code the type Tree is parameterized on a value of type Context.  Trees from different Contexts should not be mixed together. The same applies to different Exprs.

Consider a method which, given two expressions, combines them together, producing an expression representing their sum:

def combine(c: Context)(tree1: c.Expr[Int], tree2: c.Expr[Int]): c.Expr[Int]

(I'll ignore the possibility of using splicing, because splicing is specific to the reflection library, while the problem at hand is more general).
The need to pass the context makes it less convenient to use. This would be more annoying if it were an infix operator. With our syntax, the problem disappears:

def combine[[c: Context]](tree1: c.Expr[Int], tree2: c.Expr[Int]): c.Expr[Int]

Now we can even make this a binary operator:

implicit class RichExpr[[c: Context]](tree1: c.Expr[Int]) {
  def +(tree2: c.Expr[Int]): c.Expr[Int]
}


Another example is handling of multiple access spaces. In various scenarios, you might want to distinguish between local memory and remote memory, which can be allocated in different locations. Converting between different locations requires copying the data and is thus expensive; hence such conversions should not happen implicitly, but should be inserted explicitly by the programmer, who should take care of having as few conversions as possible.

trait Location {
  type Rep[+T] //as in Lightweight modular staging.
}


trait Matrix[T: Numeric] {
}

implicit class RichMatrix[[l: Location]](m1: l.Rep[Matrix[Int]]) {
  def *(m2: l.Rep[Matrix[Int]]): l.Rep[Matrix[Int]] =
    ??? //omitted
}

Using such technique, one can embed different type systems where a region separation is enforced — for instance phase types as described by Cohen et al. (2012), or (a generalization of) client/server locations, as used by Ibrahim et al. (2009) and Oleg Kiselyov (2010). In the latter case, using locations is necessary to track multiple hosts.

Conclusion

I hope my examples suggest that this proposal, while simple and (in hindsight) pretty obvious, has several advantages over the current situation. What I find amazing is that while this is (or maybe only seems?) a matter of syntax, it affects the expressivity of the language, while in programming language theory syntax is usually considered of little importance.
Feedback is more than welcome.

Bibliography:

Tiark Rompf and Martin Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs, GPCE 2010.

Michael Cohen, Haitao Steve Zhu, Senem Ezgi Emgin, Yu David Liu, Energy Types, OOPSLA 2012.

Ali Ibrahim, Yang Jiao, Eli Tilevich, and William R. Cook. Remote batch invocation for compositional object services, ECOOP 2009.

Oleg Kiselyov, Semi-implicit batched remote code execution as staging. http://okmij.org/ftp/meta-future/, 2010.

Monday, December 3, 2012

"To submit your Computer Science research, please agree to submit no software"



For services where you can post content, Terms of Service often require you to censor yourself, and that can make sense. But if they call for self-censorship in research, then I, as a researcher, am concerned, and other researchers should be, too.

When research papers are submitted to conferences for review and publication, they are often managed through some automatic conference management system, which then researchers have to use to submit a paper. If I want to use those automatic systems, I need to accept these terms of services (which sounds already potentially controversial, but I have much more serious issues to discuss). One of these systems, EasyChair, recently introduced some terms of service. Since the EasyChair website explicitly asks for feedback, I started skimming them and found the following problematic clause:

"5.10 You agree not to post any Content that
[…]
  (c) contains invalid data, viruses, worms, or other software agents;
[…]
  (e) contains unsolicited or unauthorised advertising, promotional
  materials, junk mail, spam, chain letters, pyramid schemes, or any
  other form of solicitation;
  (f) contains files or programs designed to interrupt, destroy or
  limit the functionality of any computer software or hardware or
  telecommunications equipment;
[…]"

Notice in particular point (c), from which comes the title of this post. So wait, I'm not supposed to post "invalid data" and that's discussed together with "viruses" and other "software agents"? Any Computer Science paper might include programs - in particular, source code for programs which are discussed in the paper; there is even research about "software agents". Also, PostScript documents are clearly software, and I maybe PDF documents contain software agents as well; for instance, font faces are often software.

But OK, let's say that the point of these terms is just to forbid hackers from using the service, point (c) is really about malicious agents, and that they are just badly phrased. What would happen if I were a security researcher (which I am not)? Won't my research be possibly about viruses, and possibly contain actual viruses attached to the paper? I don't mean to harm the people accessing my content, but I'm still forbidden from attaching viruses even if that's relevant for my research and they're not harmful.

There was a couple clause I omitted, since I'm most confused about them:
"5.10 You agree not to post any Content that
  (a) may constitute or contribute to a crime or tort;
[…]
  (d) infringes any patent, trademark, trade secret, copyright or
  other proprietary rights of any party;"
This seems to make sense, but last time I checked some crimes required computers. And those computers work only thanks to the work of many researchers. Now, are those researchers contributing to crime? If this sounds too philosophical, what if I publish a paper which breaks a cryptographic algorithm or reveal some vulnerability? It's clear that this "may contribute to a crime" when the crime is stealing encrypted data.
I'm not sure about clause (d) on patents. It takes tons of money to check that my idea was not covered by some stupid patent with some strange phrasing, and I know no researcher who bothers doing that. Moreover, in Europe software patents are by law not valid, even all the ones approved by the patent office, so why should I bother? Just because I need that to make a company happy, who's not even publishing my research paper, just storing it for reviewers to see?

Disclaimer: I am not a lawyer, but as a researcher using this system I'm still supposed to read and accept these terms of service. If they're not a legal problem, how am I supposed to know? The company is supposed to make them as clear as possible so that I can accept them without problems.
I also don't fear realistically being sued, should I ever violate these terms.
On the other hand, I wouldn't be surprised if somebody attached a virus to his paper as "source material", some antivirus identified it and the author's account was terminated — by Murphy's law, right before a deadline.

Finally, I am additionally contacting the EasyChair authors for feedback, but I still thought that making my message public would help anybody else concerned.

Wednesday, July 28, 2010

Automatic race detection

I wanted to recommend a cool Google tool, based on Valgrind, which performs automatic race detection:
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer
Also DRD and Helgrind exist (both part of Valgrind), but I've seen Helgrind considered "not very stable" more often than not (and it's still maybe a bit experimental). While ThreadSanitizer seems to be in active development and use, and it is supported by Google. And it even has Vim integration!
I remembered of this during this discussion about something similar to race detection, in a context halfway between Python and share-nothing concurrency: http://codespeak.net/pipermail/pypy-dev/2010q3/006037.html
I hope to get your comments on this tool.

Tuesday, July 27, 2010

Pretending you know what you're talking about - a book review

Some books in Computer Science seem to do just that: pretending they know what they are talking about, when they in fact do not.

A quote on this, not restricted to Java books:
Most Java books are written by people who couldn't get a job as a Java programmer (since programming almost always pays more than book writing; I know because I've done both). These books are full of errors, bad advice, and bad programs.
Peter Norvig, Director of Research at Google, Inc.

I have seen such books myself. But they do not get great reviews at Amazon and they do not get slashdotted.
Instead, Virtual Machine Design and Implementation in C/C++ does. Its author is great at pretending he knows what he is talking about. I started looking at it for my seminar on Virtual Machines, hoping to make some use of it, and I was deeply disappointed. Looking at the table of contents and index, Virtual Machines: Versatile Platforms for Systems and Processes seems to be a better book on the topic , which covers also system virtualization (a different topic having the same name), however I cannot really judge.
Virtual Machines, by Iain D. Craig, seems instead devoted to semantic issues, and I am not qualified to judge that topic, only to say it is different.
Back to the first book, after reading sample pages from Amazon preview (mostly the table of contents and the index), together with all Amazon reviews, I realized what is happening here.

So, the rest of this post is a (meta-?)review against this book - which is much less interesting, unless you were actually considering to buy it.



The author does know what he is talking about, and spent lots of time polishing it, he's just totally unaware that it is completely unrelated to the title of the book, since he has no experience in the field. Reading the introduction after reading the above quote is enlightening - the author mentions being poor (page xvii), and his experience (described at page iv), like writing CASE tools in Java, is totally unrelated - if he couldn't get a better job, I'd say Norvig's quote is exactly about him. And after reading this, the mention he makes about when "he used to disable paging and run on pure SDRAM" smells of a lamer wanting to show off (in other contexts, it could be just a joke, I know).

The author is just trying to learn by himself how to implement an emulator, and writing a diary on this. If you care about real Virtual Machines (Parrot, the JVM, .NET, and so on), you need entirely different material. Say, JIT compilation. Other reviews mention some more points which are missing, but none of them had a real introduction to the field, so they are not aware of how much else is not in the book. Actually, maybe he knows how a VM looks from outside, but his attempts to move in that direction (like support for multithreading) look quite clumsy - he talks about that and then does not manage to implement it.

Finally, the author seems to be an assembler programmer who is programming assembler in C++. As we remember, it is famously known that "the determined Real Programmer can write Fortran programs in any language", and it is still true with Assembler. Things like manual loop unrolling on debug statements (mentioned in reviews) are quite funny.

In the end, I'd recommend this book to nobody - it might contain some interesting stuff about the actual topic, asa acknowledged by some reviews, but I would not buy a book because of this hope. Especially, not for who cares about Virtual Machines.

Monday, July 26, 2010

JVM vs .NET CLR - a comparison between the VMs

I was reading a Java vs. F# comparison on this post, and it ended up comparing the JVM vs. the CLR. It also tries to counter some points of a post by Cliff Click, but it does so in a bad way. That's why I try here to improve on that comparison.

The three real limitations of the JVM, compared to the CLR, are the lack of:
  • tail calls (being addressed), which are important for functional languages running on top of the JVM, like Clojure and Scala
  • value types: I was happy to read on that post that there's interest on that, and I already read this on Guy Steele's Growing a Language.
  • proper generics, without erasure: problems were well-known when generics were introduced, but back-compatibility with binary libraries forced the current solution, so this one is not going to be solved.
As an additional, it seems to me that since Java and the JVM is not managed by a single company, but by the Java Community Process, addition of new features like these is much slower (but hopefully they are better designed).

Combining generics with value types would allow great memory (and thus performance) savings: one could define Pair as a value type and then use an ArrayList> as the backing storage for an hash table and pay no space overhead.

An additional point for dynamic languages, the lack of an invokedynamic primitive creates significant performance problems - for instance, Jython (a JITted language) is 2x slower than CPython (an interpreted language with a slow interpreter). Anecdotal evidence suggests that lack of support for inline caching is an important reason: namely, a reimplementation of Python on top of the JVM for a course project, which allowed inline caching, was much faster.

About JVM vs CLR JITs, discussing the quality of existing implementations: Cliff Click mentioned old anecdotal evidence of the CLR JIT being slower because .NET is geared towards static compilation and not so much effort has been put into it. I guess that Click refers to some optimizations which are missing from the CLR. For instance, at some point in the past replacing List (a class type) by IList (an interface type), in C#, caused a 10x slowdown that a good JIT compiler is able to remove, as discussed in Click's blog post. Dunno if this still holds.

Anyway, this comparison is about the JVM vs. the Microsoft CLR implementation, running only on Windows. Mono is available for Linux, but it uses a partially conservative garbage collector based on Boehm's GC, and this gives really inferior results in some cases. The Boehm's authors claim here that such cases can be easily fixed by changing the source program, but this is valid only when the application is written for this GC, not when you want to support programs written for an accurate GC.
Some evidence about this, where the described worst case is implemented, is available here.
In the end, if you want to run over Linux, you have no real choice if you aim for the best and most robust performance, currently (as also suggested by this entry of the Language Shootout, but read their notes about the flawedness of such comparisons). We have to hope for improvements in Mono.

Saturday, June 12, 2010

Teach Yourself Programming in Ten Years

I came across this link by random browsing (Internet, you know?) and it's just wonderful.
Just one quote, about the "Teach Yourself $Language_X in $small_number_of_days" books: "Bad programming is easy. Idiots can learn it in 21 days, even if they are dummies."

http://norvig.com/21-days.html

Wednesday, June 9, 2010

Stereotypes, averages and benchmarks

While reasoning on the idea of stereotypes and how useful they are to understand a different culture, I realized that a stereotype, together with problems in using it, is at least as as bad as an average over a diverse set (here we'll ignore the fact that it's also a judgement by a culture on another culture, often presuming without reason that). And as we know,

And after realizing this, one sees connections with many issues in research and in Computer Science. For instance, averages of different benchmarks. It's often fine to benchmark two different algorithms on some data sets: if the algorithm domain is simple enough, the variation on different inputs is small, and the input can anyway be characterized by a small number of features. Think of sorting algorithms: the inputs you pass them can be random, Gaussian, ordered, reverse-ordered, and that's it more or less. OK, I'm oversimplifying a bit, I guess. But in other cases, the input space has a much higher number of dimensions.

And then, even worse, benchmarks about programming languages. Not only we have all these problems about sampling the input space, but in this case the input spaces of two language implementations, for different languages, are not the same, and they don't match in a meaningful way: they are not isomorphic, just like spoken languages. In other words, there is not a unique way of translating a program, and when you do, there's no way to make the pieces match one-by-one. There are infinite ways to solve a problem in any Turing-complete language (and a not-so-small number of reasonable alternative), and an intrinsically different infinity of ways for another language. And maybe, your translation is too slow because it's not appropriate for that language, and you should write the same program in a more idiomatic style.
The same concept is expressed here in a less abstract way.

And in this situation, not only we have actual benchmarks about programming languages, but even performance myths for or against Java, C#, Python, C, and so on. I.e., stereotypes, once again, about languages. And this time, mostly false.

For instance, we could talk of people thinking that Java is slow and Python is more lightweight, while it's actually the other way around, as long as we speak of throughput. What those people think is only true about startup latency, and only partially about memory consumption (Python has prompter garbage collection because of refcounting, but its object layout could benefit some huge improvement). And now in this example, we see that not only the input space is high-dimensional, but that even the performance space cannot be characterized by a single number. To compare memory consumption, we need to give a function of the used memory for a given object set, for Java and for Python. And the object sets are, again, not isomorphic! We're over.

Trying to sum up too much any result is going to give us lies. We can't help it; we should stop asking for simple answers to hard questions, and for silver bullets, and for a lot of other easy things, and go instead to work hard and enjoy the results.