Category Archives: Code Generation

Code generation is a lot like compilation, except that the source isn’t a real programming language, and the target itself tends to be a high-level language.

So You Feel Lucky?

I’ve just finished reading Stephen Jay Gould’s excellent book Wonderful Life (again) and it got me thinking about random trees.

In case you haven’t read it, Wonderful Life is about the fossil bed known as the Burgess Shales which contains extrordinarily well preserved fossils of soft and hard-bodied animals from a period just after the so-called Cambrian Explosion. The Cambrian Explosion marked the period when the seas first “exploded” with an enormous range of large, multicellular animals with hard shells that preserve easily. In the 1980s a detailed re-evaluation of the fossils found in the Burgess Shales provoked a scientific revolution in paleontology, because it turns out that only a small percentage of those fossils have any direct living descendants, and many of them represent previously unknown phyla (basic types of animals.) This did not fit comfortably with the established notion of evolution as ordered progress, with the basic groups of animals established early on and forming a predictable lineage all the way from microbe to man at the pinnacle. Rather it paints the picture of extinction being the norm, and the survival of one group or another very much in the hands of chance and historical contingency. The book is not an argument against Darwinism but rather a re-evaluation of some of its finer points. Crudely put, it’s not arguing against the existance of a Tree of Life, just questioning what shape the tree is.

Anyway with that in mind, and the somewhat vague hand-drawn trees in the book leaving my curiosity piqued, I started wondering what any real evolutionary tree might look like. Of course it’s impossible to ever produce an algorithm that will accurately represent a real evolutionary sequence, so I thought to keep it very simple.

We start with a “first progenitor“. It has two choices: form two new species or die out.

Each new species has the same option at the next toss of the coin. That’s it. In perl it would look something like this:


So there’s a 1/2 probability that the thing will never get started, and you’re left with a stump rather than a tree. But with 2 children, there’s only a ¼ chance that they will both die out, and if they both survive then there are 4 grandchildren, and so on. This code has a definite probability of running forever.

It turns out that if you run this a large number of times, and add up the  number of each depth reached, you get a curve that asymptotically approaches zero at infinity:


The graph is normalized so the trees of depth zero come out at 0.5. The little kick at the right is those that reached the maximum depth in my test.

So what do these trees look like? I’ve given the game away by using a picture of one of them as the featured image for this post. As for generating the images, the excellent GraphViz comes to our rescue. With a little jiggery-pokery we can get the above perl code  to produce a .dot  file that we can feed to GraphViz and get a picture. I’ve extended the code to color nodes and vertices red if they are “survivors” (have descendents at the limiting depth) and black if they represent a species with no descendants. I’ve also changed the code to try again repeatedly until it generates a tree that reaches a limiting depth. Here’s a representative:


The limit was set at 60, so assuming 2 million years to create a species (I remember that figure from somewhere, I have a bad habit of throwing up unverified facts) this represents about 120,000,000 years of evolution from a single common ancestor. The interesting thing here I think is that the majority of branches don’t make it. Extinction is the norm, even for apparently large and flourishing branches. Apparently insignificant branches can suddenly flourish, and equally suddenly die out. I think this is close to Gould’s vision in general, if not in detail.

The other interesting thing is the huge variety of shapes. Some trees are wide, others are narrow, for example:


In this case all of the survivors share a common ancestor only a few generations (speciations) ago. This could easily be a model for the very earliest life, since the common ancestor of all current life on earth, who’s closest living relative is likely one of the Archaea, is far too complex to be a “first progenitor”.

I don’t know where I’m going with this from here, probably nowhere, but I think it’s interesting.

To finish off, here’s the full implementation of the tree generating code in case you want to try it yourself.  You can pick up GraphViz from and run it from the command-line (the commands are called dot , neato , circo  etc.) or via a gui.


A Slightly Non-Standard Application

Here’s an application of code generation that doesn’t fit the standard Enterprise Application mold: here I’m using code generation to build a class heirarchy for abstract syntax trees.

The idea is pretty simple. An abstract syntax tree is a tree containing the results of parsing a programming language. Nodes of the tree could be function calls, operator application or control flow constructs, and the leaves of the tree would be comstants and variables. Of course these trees cannot be constructed arbitrarily, it doesn’t often make sense to add an integer to a record for example, so the constructors for the nodes need to declare (or, in an untyped language, type-check) their arguments. Each node will furthermore need accessors and probably an accept() method for the Visitor pattern, so that is a substantial amount of boiler-plate code for something that is easily describeable at a very high level.

Here’s a snapshot of the AST definition from F♮:

This is a high level but complete description of the required classes, all of the details of implementation are merely repetitive, which makes it a perfect candidate for code generation.

So, given this description, and a couple of command-line options to cover class prefixes etc., my treebuilder (tbd) will generate all of the classes required. For example, here’s the generated class definition for Expr If(Expr test, Sequence consequent, Sequence alternative):

The point of all this is to be able to use these constructors directly in a yapp or similar grammar file. Here’s a snippet of these generated classes in action:

One of the visitors over the resulting tree is a rewriting visitor that converts some abstract syntax to simpler forms so that there are fewer constructs that need to be dealt with later on. Here’s the visitor routine that converts If abstract syntax to a function call: