Saturday, July 19, 2008

Beautiful Code: Computer-Generated Code

This post is part of the Beautiful Code series.

I saw two really wonderful examples of computer-generated code in Beautiful Code. First was a chapter by Charles Petzold—yes, that Charles Petzold—who was discussing image processing. Petzold’s example is fairly complex, so I’ll try to give a simpler example, to illustrate the general concept.

Suppose you have an array of integers, and for each integer, you want to add 1 to the number if it’s less than 10. A ridiculously contrived example, but it is simple enough to explain the concept. Now suppose your array includes 5 items, with the following values:

If you knew that ahead of time, then you could write code like this:
function someRidiculousFunction() {
int anArray[10] = {1,16,72,2,4};

We know that the first, fourth, and fifth elements in the array are less than 10, so they need to be incremented.

In real life, however, chances are our code wouldn’t know which elements are less than 10, and which aren’t. The code might not even know how many elements are in the array. So the code is more likely to look more like this:
function moreRealisticFunction() {
int anArray[] = generateArray();

for(int i = 0; i < anArray.length; i++) {
if(anArray[i] < 10)
We need a loop, to cycle through the elements of the array, and we need an if statement, to determine if each element in the array is less than 10, before incrementing it. The first code sample just isn’t realistic; there are very few instances in real life when you’d be able to write code like that. Not only that, but for an array of any size, the second function will be much more readable and maintainable by future programmers. Having a function with a hundred lines after each other, like in the first example (if the array were bigger) would numb the mind.

But the thing is, someRidiculousFunction() would be more efficient than the moreRealisticFunction() would! If only we did know ahead of time, when we were writing the program, how many elements there would be in the array, and which (if any) needed to be incremented, then we wouldn’t need the loop, or the if statements.

Petzold presents a much more complex situation, having to do with processing images. Because an image (such as a JPG or GIF file) will potentially have millions of bytes, representing many, many pixels, processing each pixel to do things like adding blur effects can be a time consuming process. Anything that can be done to streamline the process would present an amazing time savings to the program—but the issue is that you need all of those if statements and loops, to go through the image.

His solution is to use .NET’s ability to generate code on the fly. When you’re writing your program for manipulating images, you can’t necessarily write code like you could for someRidiculousFunction(), but you can have your program generate code that works that way, in which case you can avoid the tedious if statements and looping, and just generate mountains and mountains of procedural code. Since it is being generated at runtime, there is no worry about maintaining the code afterward, so you don’t need to worry about how pretty or understandable (or beautiful) that code is; you only need to worry about maintaining the code that generates that code. (Too meta for you? Read through it again. Or, better yet, buy the book, and read Petzold’s chapter.)

That’s at runtime, but what about at compile time? Most programmers are familiar with the C preprocessor, and the ability to create macros that will expand within the code at compile time, but most programmers are also aware of the limitations of the C preprocessor’s abilities. But that doesn’t mean that generating code at (or near) compile time is a bad idea. In a chapter by Diomidis Spinellis I came across a novel idea: use awk (or a similar text processing language) as your code preprocessor! (Note: Spinellis kept italicizing “awk,” so I am too. I don’t know if it really has to be italicized.)

Suppose you have highly repetitive pieces of code, repeating throughout your program. Instead of typing it out over and over again, you can include a simple notation instead, which can be processed by a language like awk, and expanded into real code, which would then be compiled. That is: write your source code, potentially involving a special notation; save your source code files; run those files through a program written in awk, which would process the special notations within your source code and generate more source code; save the new code into a new file; run the new file through the compiler, to create the program.

Doing this in awk instead of the C preprocessor has two advantages that I can think of:
  1. Using the C preprocessor limits you to the syntax available; it’s its own little language. However, with awk, or similar languages, you have the power of regular expressions at your fingertips, and can do some very complex text manipulation, if necessary.
  2. The C preprocessor is available to you if you’re writing code in C, or C++, and there are some similar things being added to later versions of Java for performing certain tasks, but using something like awk would work for any programming language you choose.
Let’s look at an example from Spinellis.

The handling of locking assertions deserves more explanation. For each argument, the code lists the state of its lock for three instances: when the function is entered, when the function exits successfully, and when the function exits with an error—an elegantly clear separation of concerns. …. The following code excerpt indicates that the rename call arguments fdvp and fvp are always unlocked, but the argument tdvp has a process-exclusive lock when the routine is called. All arguments should be unlocked when the function terminates.

#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U

The locking specification is used to instrument the C code with assertions at the function’s entry, the function’s normal exit, and the function’s error exit. For example, the code at the entry point of the rename function contains the following assertions:

Now, the code starting with # characters isn’t actually C code. In fact, the C compiler would throw errors at those lines. But that doesn’t matter because awk is going to replace those lines with real C code, which the compiler will like. In fact, for each line of #-prefixed code above, 3 lines of code would be inserted into the code—tedious, error-prone code, which has nothing to do with the actual logic the programmer was trying to accomplish. According to Spinellis, “In the FreeBSD version 6.1 implementation of the vnode call interface, all in all, 588 lines of domain-specific code expand into 4,339 lines of C code and declarations.”

If your code has a lot of repetitious, error-prone code in it, and if you have the ability to insert awk or a similar tool into your compilation process, this technique can save you not only time, but debugging effort, too. Then again, with modern IDEs—especially ones that compile your code as you work, but even ones that just warn you about syntax errors—you’ll also need to contend with code that your IDE keeps warning you about. One solution might be to include your special notations in comments, that the IDE will ignore, and have awk process that. For example, instead of
#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U
you might do something like this, instead:
//#% rename fdvp U U U
//#% rename fvp U U U
//#% rename tdvp E U U
You just need to alter the regex’s in your awk script to use //# instead of # when looking for lines to process.