Saturday, July 19, 2008

Beautiful Code: Computer-Generated Code

This post is part of the Beautiful Code series.

I saw two really wonderful examples of computer-generated code in Beautiful Code. First was a chapter by Charles Petzold—yes, that Charles Petzold—who was discussing image processing. Petzold’s example is fairly complex, so I’ll try to give a simpler example, to illustrate the general concept.

Suppose you have an array of integers, and for each integer, you want to add 1 to the number if it’s less than 10. A ridiculously contrived example, but it is simple enough to explain the concept. Now suppose your array includes 5 items, with the following values:

If you knew that ahead of time, then you could write code like this:
function someRidiculousFunction() {
int anArray[10] = {1,16,72,2,4};

We know that the first, fourth, and fifth elements in the array are less than 10, so they need to be incremented.

In real life, however, chances are our code wouldn’t know which elements are less than 10, and which aren’t. The code might not even know how many elements are in the array. So the code is more likely to look more like this:
function moreRealisticFunction() {
int anArray[] = generateArray();

for(int i = 0; i < anArray.length; i++) {
if(anArray[i] < 10)
We need a loop, to cycle through the elements of the array, and we need an if statement, to determine if each element in the array is less than 10, before incrementing it. The first code sample just isn’t realistic; there are very few instances in real life when you’d be able to write code like that. Not only that, but for an array of any size, the second function will be much more readable and maintainable by future programmers. Having a function with a hundred lines after each other, like in the first example (if the array were bigger) would numb the mind.

But the thing is, someRidiculousFunction() would be more efficient than the moreRealisticFunction() would! If only we did know ahead of time, when we were writing the program, how many elements there would be in the array, and which (if any) needed to be incremented, then we wouldn’t need the loop, or the if statements.

Petzold presents a much more complex situation, having to do with processing images. Because an image (such as a JPG or GIF file) will potentially have millions of bytes, representing many, many pixels, processing each pixel to do things like adding blur effects can be a time consuming process. Anything that can be done to streamline the process would present an amazing time savings to the program—but the issue is that you need all of those if statements and loops, to go through the image.

His solution is to use .NET’s ability to generate code on the fly. When you’re writing your program for manipulating images, you can’t necessarily write code like you could for someRidiculousFunction(), but you can have your program generate code that works that way, in which case you can avoid the tedious if statements and looping, and just generate mountains and mountains of procedural code. Since it is being generated at runtime, there is no worry about maintaining the code afterward, so you don’t need to worry about how pretty or understandable (or beautiful) that code is; you only need to worry about maintaining the code that generates that code. (Too meta for you? Read through it again. Or, better yet, buy the book, and read Petzold’s chapter.)

That’s at runtime, but what about at compile time? Most programmers are familiar with the C preprocessor, and the ability to create macros that will expand within the code at compile time, but most programmers are also aware of the limitations of the C preprocessor’s abilities. But that doesn’t mean that generating code at (or near) compile time is a bad idea. In a chapter by Diomidis Spinellis I came across a novel idea: use awk (or a similar text processing language) as your code preprocessor! (Note: Spinellis kept italicizing “awk,” so I am too. I don’t know if it really has to be italicized.)

Suppose you have highly repetitive pieces of code, repeating throughout your program. Instead of typing it out over and over again, you can include a simple notation instead, which can be processed by a language like awk, and expanded into real code, which would then be compiled. That is: write your source code, potentially involving a special notation; save your source code files; run those files through a program written in awk, which would process the special notations within your source code and generate more source code; save the new code into a new file; run the new file through the compiler, to create the program.

Doing this in awk instead of the C preprocessor has two advantages that I can think of:
  1. Using the C preprocessor limits you to the syntax available; it’s its own little language. However, with awk, or similar languages, you have the power of regular expressions at your fingertips, and can do some very complex text manipulation, if necessary.
  2. The C preprocessor is available to you if you’re writing code in C, or C++, and there are some similar things being added to later versions of Java for performing certain tasks, but using something like awk would work for any programming language you choose.
Let’s look at an example from Spinellis.

The handling of locking assertions deserves more explanation. For each argument, the code lists the state of its lock for three instances: when the function is entered, when the function exits successfully, and when the function exits with an error—an elegantly clear separation of concerns. …. The following code excerpt indicates that the rename call arguments fdvp and fvp are always unlocked, but the argument tdvp has a process-exclusive lock when the routine is called. All arguments should be unlocked when the function terminates.

#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U

The locking specification is used to instrument the C code with assertions at the function’s entry, the function’s normal exit, and the function’s error exit. For example, the code at the entry point of the rename function contains the following assertions:

Now, the code starting with # characters isn’t actually C code. In fact, the C compiler would throw errors at those lines. But that doesn’t matter because awk is going to replace those lines with real C code, which the compiler will like. In fact, for each line of #-prefixed code above, 3 lines of code would be inserted into the code—tedious, error-prone code, which has nothing to do with the actual logic the programmer was trying to accomplish. According to Spinellis, “In the FreeBSD version 6.1 implementation of the vnode call interface, all in all, 588 lines of domain-specific code expand into 4,339 lines of C code and declarations.”

If your code has a lot of repetitious, error-prone code in it, and if you have the ability to insert awk or a similar tool into your compilation process, this technique can save you not only time, but debugging effort, too. Then again, with modern IDEs—especially ones that compile your code as you work, but even ones that just warn you about syntax errors—you’ll also need to contend with code that your IDE keeps warning you about. One solution might be to include your special notations in comments, that the IDE will ignore, and have awk process that. For example, instead of
#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U
you might do something like this, instead:
//#% rename fdvp U U U
//#% rename fvp U U U
//#% rename tdvp E U U
You just need to alter the regex’s in your awk script to use //# instead of # when looking for lines to process.


Anonymous said...

Regular expression is a good example here, too. The RE library will generate code at runtime that matches the pattern

David Hunter said...

Many apologies to Diomidis Spinellis, for misspelling his name in this post! It has now been corrected—hopefully before anyone noticed, or was offended.

Robert said...

This is a lovely concept, isn't it? I used to do a very limited version of this in an ancient procedural database called Superbase; it had a database language that had an 'execute' function that would run arbitrary text strings as commands to the database language. You could write modules that would then write customized mini-programs and run them, depending on program variables at runtime.

It made the language extremely flexible.

Anonymous said...

My apologies if this is a silly question -- I'm really not that good of a programmer -- but I fail to see the value in the first example. If the program is given an arbitrary image to process, the code generator will have to go through all of those same if statements and loops in order to generate someRidiculousFunction(), right? I mean, I can see how this could be useful if it could use the same process to generate multiple someRidiculousFunction()s that all used the same original loop (possibly with case statements instead of ifs), but the loop and ifs would still be there somewhere, or else the work couldn't get done. You're basically just moving the loops and ifs to an earlier point in the process.

I don't doubt that in some situations it is more efficient to create throwaway computer-generated code than to do it in a more traditional way. However, the example given does not seem like a good one.

David Hunter said...

Your point is correct; if all you’re doing is moving your loop to an earlier point in the process, you’re probably not buying yourself anything. For this technique to be useful, the amount of work to be done by the computer would have to be less, and/or you’d have to have good benchmarks to prove that you’re saving the computer time, not just moving instructions from place to place.

Generating code at runtime, in addition to being very strange to the eyes—meaning that you’ll have to document it well for future developers—would also only work in very isolated situations. The example that Petzold gave in the book, for the image processing, was detailed enough that I didn’t completely follow it, but he also had some very good numbers to back it up. (In fact, he documents that it’s a technique that goes all the way back to the BitBlt code in the early versions of Windows; instead of generating .NET code, of course, BitBlt was generating assembly. I don’t have the book in front of me, so I can’t give details about that, but Petzold shows how much of a performance gain it was for BitBlt.)

Unknown said...

The C pre-processor is actually available to other languages. I've used #defines to write macros to reduce the tedium of writing tight crypto code in Java. After running through gcc's preprocessor, I get legal Java as the output, and gcc doesn't complain that it's not valid C.

I believe m4 has been designed to provide macros for arbitrary languages, but I didn't feel like learning m4 macros.