Tuesday, July 29, 2008

Stinking Paper Towel Dispensers

I wrote quite a while ago about the fact that they’d replaced the paper towel dispensers at my office with automated ones. At the time, I wasn’t happy about the fact.

Now that I’ve been living with them for a while, I’m even less happy. For some reason, they tend to run out of paper towels a lot more often than the manual ones used to. (Is it because they dispense more paper towel than is necessary? I don’t know.) So, many times I’ll go into the washroom, wash my hands—and maybe even splash some water on my face, which I do from time to time—only to find that I have nothing with which to dry myself.

This is a very petty complaint, I realize that. I’m not expecting pity.

Exercise—I’m still doing it!

I posted a while ago that I’d started exercising—crunches and push-ups—and I wondered how long I’d keep it up. It turns out I’ve kept it up for the last couple of months, which is better than I thought I’d do.

But I’ve now added something new to my regimen: The escalators are out of commission for the next twelve weeks, in the office where I work, and my desk is on the fifth floor. So I’ve decided, since the elevators are so slow, that I’m going to start taking the stairs. It turns out that I’m very much out of shape, but I’m hoping that after a week or two of taking the stairs, I won’t be so winded after the climb anymore…

Cuil

A new search engine has been launched, named Cuil (pronounced “cool”), which is touting itself as a Google killer. Am I one of the first to write about it? Naw, I doubt it. Maybe one of the first ten thousand. Anyway, they claim to have a bigger index, and, therefore, a better ability to find search results.

I first read about it on Wired, in an article that seemed very positive, but Wired isn’t always that prescient. I thought I’d try it out; I’ve been having trouble writing a plugin for gedit, in Python, that I just can’t quite get working. (I don’t know if it’s because I don’t know Python—which I don’t—or if it’s because I don’t know the gedit/gtk programming model, which is really poorly documented.) I’ve done lots of searches in Google to try and find help on how to debug a plugin, and I’m seeing lots of example plugins and other stuff, but nothing that is helping with my troubleshooting. So I pulled up Cuil, and started searching for things like “help writing a Python plugin for gedit” and similar phrases. Also, for fun, I also searched for “David Hunter”.

What I got was a lot of error messages, and very few search results. In fact, in my gedit-plugin-related searches, I never got any results. In my “David Hunter” searches, I did get a page of results, although I never found myself, and the page seemed to indicate that it was still loading, but never actually finished.

Now, I wasn’t worried about the error messages. It’ll take them a while to get their infrastructure firmed up, and get the kinks worked out. (One might have hoped that they’d have done that before launching, but hey, I’m a forgiving guy.) The thing is, even if the search results were great, even if I’d managed to find everything I was looking for—especially the gedit plugin stuff, since I wasn’t able to find a lot of help through Google—they’d still have an uphill battle trying to unseat Google. Google is more than search. I would go on about it further, but this post from another blog did a better job.

The fact is, Google is pretty darned good, and even if Cuil is better, it is probably only marginally better. If I couldn’t find what I was looking for, regarding my gedit plugin, it’s probably because what I’m looking for doesn’t exist.

Sunday, July 20, 2008

Beautiful Code: False Optimization Redux

This post is part of the Beautiful Code series.

Here we’re going to look at the concept that you should know your usage patterns, before you try to optimize for them. (This is pretty much just an addendum to the False Optimizations post.)

Andrew Kuchling wrote a chapter about Python’s dictionary implementation. Why is Python’s dictionary implementation beautiful? One reason is the way that it determines its size: the dictionary always keeps a number of spots open, for new insertions, and any time the dictionary gets too full (that is, too few empty spots), it quadruples in size, to make more room. And what happens when you remove an item from the dictionary? Well… nothing. A Python dictionary expands when necessary, but never bothers to contract.

Any why is this? Because of the way that Python dictionaries are used:

This means that dictionaries are never resized on deletion. If you build a large dictionary and then delete many keys from it, the dictionary’s hash table may be larger than if you’d constructed the smaller dictionary directly. This usage pattern is quite infrequent, though. Keys are almost never deleted from the many small dictionaries used for objects and for passing function arguments. Many Python programs will build a dictionary, work with it for a while, and then discard the whole dictionary. Therefore, very few Python programs will encounter high memory usage because of the no-resize-on-deletion policy.

Honestly, I would have put this in the False Optimizations post, except that I’d already posted it by the time I started writing about this chapter. But the concept is the same: It seems that the right thing to do is to have the dictionary expand and contract as necessary. But since most Python programs don’t spend a lot of time removing items from dictionaries, it’s not worth the effort to build that functionality into the Python dictionary implementation. For a tradeoff of taking up a bit more memory, the Python dictionary implementation means that deleting from a dictionary is quicker (since no shrinking of the dictionary needs to happen), and that the Python dictionary itself is easier to maintain for the people who build Python itself, because less code means easier to maintain code.

Saturday, July 19, 2008

Beautiful Code: Runtime Type Checking

This post is part of the Beautiful Code series.

I have to admit, the title of this post might be misleading. Scratch that: It is misleading. This post is actually about a lack of runtime type checking. This quote is from Greg Kroah-Hartman, on a chapter about the Linux kernel driver model. He has explained some concepts, and given some code samples, and then he says this:

If you notice, there is no runtime type checking to ensure that the pointer that was originally passed as a struct device really was of the struct usb_interface type. Traditionally, most systems that do this kind of pointer manipulation also have a field in the base structure that defines the type of the pointer being manipulated, to catch sloppy programming errors. It also allows for code to be written to dynamically determine the type of the pointer and do different things with it based on its type.

The Linux kernel developers made the decision to do none of this checking or type definition. These types of checks can catch basic programming errors at the initial time of development, but allow programmers to create hacks that can have much more subtle problems later on that can’t easily be caught.

Whoa. In the Linux kernel driver model, in this instance, they don’t do runtime type checking. But what if a developer passes the wrong kind of struct? Well, if I may paraphrase Kroan-Hartman—perhaps incorrectly—if people are passing the wrong kind of struct, then they have bigger problems. Frankly, if the code did do runtime type checking, developers would probably simply set the appropriate value in the struct, until they found a value that worked. They’d still have the problem, they’d just mask it, but now it would be hard to track down and find. So why waste the kernel’s time doing this runtime type checking, when it’s not really going to buy anything anyway?

Is this a lesson that we can apply in other areas? Maybe, although you’d have to be very careful about it. We’re not all writing Linux, folks. And, to be clear, the parts of the Linux kernel being discussed in this chapter aren’t really worked on by thousands, hundreds, or even dozens of people. There are only a limited subset of people who are working on the Linux kernel driver model. That makes a difference too.

Beautiful Code: Computer-Generated Code

This post is part of the Beautiful Code series.

I saw two really wonderful examples of computer-generated code in Beautiful Code. First was a chapter by Charles Petzold—yes, that Charles Petzold—who was discussing image processing. Petzold’s example is fairly complex, so I’ll try to give a simpler example, to illustrate the general concept.

Suppose you have an array of integers, and for each integer, you want to add 1 to the number if it’s less than 10. A ridiculously contrived example, but it is simple enough to explain the concept. Now suppose your array includes 5 items, with the following values:

{1,16,72,2,4}
If you knew that ahead of time, then you could write code like this:
function someRidiculousFunction() {
int anArray[10] = {1,16,72,2,4};

anArray[0]++;
anArray[3]++;
anArray[4]++;
}
We know that the first, fourth, and fifth elements in the array are less than 10, so they need to be incremented.

In real life, however, chances are our code wouldn’t know which elements are less than 10, and which aren’t. The code might not even know how many elements are in the array. So the code is more likely to look more like this:
function moreRealisticFunction() {
int anArray[] = generateArray();

for(int i = 0; i < anArray.length; i++) {
if(anArray[i] < 10)
anArray[i]++;
}
}
We need a loop, to cycle through the elements of the array, and we need an if statement, to determine if each element in the array is less than 10, before incrementing it. The first code sample just isn’t realistic; there are very few instances in real life when you’d be able to write code like that. Not only that, but for an array of any size, the second function will be much more readable and maintainable by future programmers. Having a function with a hundred lines after each other, like in the first example (if the array were bigger) would numb the mind.

But the thing is, someRidiculousFunction() would be more efficient than the moreRealisticFunction() would! If only we did know ahead of time, when we were writing the program, how many elements there would be in the array, and which (if any) needed to be incremented, then we wouldn’t need the loop, or the if statements.

Petzold presents a much more complex situation, having to do with processing images. Because an image (such as a JPG or GIF file) will potentially have millions of bytes, representing many, many pixels, processing each pixel to do things like adding blur effects can be a time consuming process. Anything that can be done to streamline the process would present an amazing time savings to the program—but the issue is that you need all of those if statements and loops, to go through the image.

His solution is to use .NET’s ability to generate code on the fly. When you’re writing your program for manipulating images, you can’t necessarily write code like you could for someRidiculousFunction(), but you can have your program generate code that works that way, in which case you can avoid the tedious if statements and looping, and just generate mountains and mountains of procedural code. Since it is being generated at runtime, there is no worry about maintaining the code afterward, so you don’t need to worry about how pretty or understandable (or beautiful) that code is; you only need to worry about maintaining the code that generates that code. (Too meta for you? Read through it again. Or, better yet, buy the book, and read Petzold’s chapter.)

That’s at runtime, but what about at compile time? Most programmers are familiar with the C preprocessor, and the ability to create macros that will expand within the code at compile time, but most programmers are also aware of the limitations of the C preprocessor’s abilities. But that doesn’t mean that generating code at (or near) compile time is a bad idea. In a chapter by Diomidis Spinellis I came across a novel idea: use awk (or a similar text processing language) as your code preprocessor! (Note: Spinellis kept italicizing “awk,” so I am too. I don’t know if it really has to be italicized.)

Suppose you have highly repetitive pieces of code, repeating throughout your program. Instead of typing it out over and over again, you can include a simple notation instead, which can be processed by a language like awk, and expanded into real code, which would then be compiled. That is: write your source code, potentially involving a special notation; save your source code files; run those files through a program written in awk, which would process the special notations within your source code and generate more source code; save the new code into a new file; run the new file through the compiler, to create the program.

Doing this in awk instead of the C preprocessor has two advantages that I can think of:
  1. Using the C preprocessor limits you to the syntax available; it’s its own little language. However, with awk, or similar languages, you have the power of regular expressions at your fingertips, and can do some very complex text manipulation, if necessary.
  2. The C preprocessor is available to you if you’re writing code in C, or C++, and there are some similar things being added to later versions of Java for performing certain tasks, but using something like awk would work for any programming language you choose.
Let’s look at an example from Spinellis.

The handling of locking assertions deserves more explanation. For each argument, the code lists the state of its lock for three instances: when the function is entered, when the function exits successfully, and when the function exits with an error—an elegantly clear separation of concerns. …. The following code excerpt indicates that the rename call arguments fdvp and fvp are always unlocked, but the argument tdvp has a process-exclusive lock when the routine is called. All arguments should be unlocked when the function terminates.

#
#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U
#

The locking specification is used to instrument the C code with assertions at the function’s entry, the function’s normal exit, and the function’s error exit. For example, the code at the entry point of the rename function contains the following assertions:

ASSERT_VOP_UNLOCKED(a->a_fdvp, "VOP_RENAME");
ASSERT_VOP_UNLOCKED(a->a_fvp, "VOP_RENAME");
ASSERT_VOP_ELOCKED(a->a_tdvp, "VOP_RENAME");
Now, the code starting with # characters isn’t actually C code. In fact, the C compiler would throw errors at those lines. But that doesn’t matter because awk is going to replace those lines with real C code, which the compiler will like. In fact, for each line of #-prefixed code above, 3 lines of code would be inserted into the code—tedious, error-prone code, which has nothing to do with the actual logic the programmer was trying to accomplish. According to Spinellis, “In the FreeBSD version 6.1 implementation of the vnode call interface, all in all, 588 lines of domain-specific code expand into 4,339 lines of C code and declarations.”

If your code has a lot of repetitious, error-prone code in it, and if you have the ability to insert awk or a similar tool into your compilation process, this technique can save you not only time, but debugging effort, too. Then again, with modern IDEs—especially ones that compile your code as you work, but even ones that just warn you about syntax errors—you’ll also need to contend with code that your IDE keeps warning you about. One solution might be to include your special notations in comments, that the IDE will ignore, and have awk process that. For example, instead of
#
#% rename fdvp U U U
#% rename fvp U U U
#% rename tdvp E U U
#
you might do something like this, instead:
//#
//#% rename fdvp U U U
//#% rename fvp U U U
//#% rename tdvp E U U
//#
You just need to alter the regex’s in your awk script to use //# instead of # when looking for lines to process.

Beautiful Code: False Optimizations

This post is from the Beautiful Code series.

One of the first concepts that jumped out at me from Beautiful Code concerned false optimizations—that is, something added to the code which should have been an optimization, but as it turns out, actually causes the code to run slower.

As an example, we’ll look at some code that implements a binary search, from a chapter written by Tim Bray:

package Binary;

public class Finder {
public static int find(String[] keys, String target) {
int high = keys.length;
int low = -1;
while (high - low > 1) {
int probe = (low + high) >>> 1;
if (keys[probe].compareTo(target) > 0)
high = probe;
else
low = probe;
}
if (low == -1 || keys[low].compareTo(target) != 0)
return -1;
else
return low;
}
}
As mentioned, this function simply searches through some strings, using the binary search algorithm, and sees if there is a match. I just quickly read through the code, and didn’t realize—until Bray pointed it out—that there is no explicit check within the loop to see if the match has been made. One would expect him to add that in, as an optimization, so that when the target is found, the loop could exit, and not bother continuing with its checks. But as it turns out, that would be a false optimization; mathematically, the match is likely to be found late enough in the loop that the extra if statements during every execution of the loop would be more of a performance hit than simply letting the loop execute a few more times. Or, as Tim puts it:

Some look at my binary-search algorithm and ask why the loop always runs to the end without checking whether it’s found the target. In fact, this is the correct behavior; the math is beyond the scope of this chapter, but with a little work, you should be able to get an intuitive feeling for it—and this is the kind of intuition I’ve observed in some of the great programmers I’ve worked with.

Let’s think about the progress of the loop. Suppose you have n elements in the array, where n is some really large number. The chance of finding the target the first time through is 1/n, a really small number. The next iteration (after you divide the search set in half) is 1/(n/2)—still small—and so on. In fact, the chance of hitting the target becomes significant only when you’re down to 10 or 20 elements, which is to say maybe the last four times through the loop. And in the case where the search fails (which is common in many applications), those extra tests are pure overhead.

You could do the math to figure out when the probability of hitting the target approaches 50 percent, but qualitatively, ask yourself: does it make sense to add extra complexity to each step of an 0(log2 N) algorithm when the chances are it will only save a small number of steps at the end?

The take-away lesson is that binary search, done properly, is a two-step process. First, write an efficient loop that positions your low and high bounds properly, then add a simple check to see whether you hit or missed.

Let’s look at another example. This is a snippet of code from Elliotte Rusty Harold, from XOM, used in verifying XML documents. After verifying an XML namespace (which is a URI), it is cached; when a namespace is encountered, the parser can look in the cache, to see if it has already been verified, and, if so, not bother verifying it again. This class is the cache Elliotte uses.
private final static class URICache {

private final static int LOAD = 6;
private String[] cache = new String[LOAD];
private int position = 0;

synchronized boolean contains(String s) {

for (int i = 0; i < LOAD; i++) {
// Here I'm assuming the namespace URIs are interned.
// This is commonly but not always true. This won't
// break if they haven't been. Using equals() instead
// of == is faster when the namespace URIs haven't been
// interned but slower if they have.
if (s == cache[i]) {
return true;
}
}
return false;
}

synchronized void put(String s) {
cache[position] = s;
position++;
if (position == LOAD) position = 0;
}
}
This code is fairly simple. It’s a helper class used to maintain a list of namespace URIs that have been captured, and a method which will tell you if a URI is in the list or not. The surprising thing is that the helper class uses a static array, with exactly six items, and if you try to add a seventh, it simply overwrites the first one (and so on). Why not use a hash map, or a table? Something dynamic, which could grow as large as necessary? Why a statically-sized array?

Again, the answer is simple performance, because Elliotte Rusty Harold has seen enough XML documents to know that rarely will any document have more than six namespaces defined in it. So by using a statically-sized array, instead of something more dynamic, he gets a boost in performance in the majority of cases. In the case where a document does have more than six namespace URIs, validation will simply take a little longer for that document, because some namespace URIs may have to be verified twice. (If the parser comes across a URI which has already been verified, but has been overwritten in the cache, then it would have to be re-verified. Of course, if the URI hasn’t been overwritten, it still won’t have to be re-verified.)

This was counter-intuitive for me, at first; it seems like the “Right Thing” would be to always make the size of your list dynamic, to deal with as many elements as are necessary. In fact, at first glance, it almost seems like a rookie developer’s mistake, to use a statically-sized list. But in this case, it was done on purpose, and for very good reason.

So again, he’s not doing something that he doesn’t have to do, even though, at first glance, it would seem like an optimization. Without thinking about it, I’d have made the size of the array dynamic, so that I’d never have to verify a URI that had already been verified, but Elliotte thought more deeply about it, and realized that making the code more complex in that manner would not have been an improvement. And he’s right; that array will be accessed much more often than it’s added to, so he was able to optimize for the majority of cases, with a small potential performance hit in the minority of cases.

Beautiful Code

As mentioned on the serna Book Blog, I recently read Beautiful Code, written by… well, written by a whole bunch of programmers who are smarter than I am. (Or better programmers, anyway.)

So I have devoted a series of posts to some of the concepts in the book that I found the most interesting. Here are the posts I included in this “series”:

And I’ll finish off with a quote from Brian Hayes, that I found especially poignant—and described my reaction to many of the concepts I read in this book:

In cartoons, the moment of discovery is depicted as a light bulb turning on in a thought bubble. In my experience, that sudden flash of understanding feels more like being thumped in the back of the head with a two-by-four. When you wake up afterwards, you’ve learned something, but by then your new insight is so blindingly obvious that you can’t quite believe you didn’t know it all along. After a few days more, you begin to suspect that maybe you did know it; you must have known it; you just needed reminding. And when you pass the discovery along to the next person, you’ll begin, “As everyone knows….”

Monday, July 14, 2008

Camping

Hmm, how long has it been since I posted? Really? That long? Ouch. I’ve been remiss. Me me.

I went “camping” this past weekend. There’s not much to say about it, but here are the highlights:

  • I’ve probably lost weight, because of the amount of blood removed from my body by mosquitos.
  • I got to know my god daughter Alexis a bit better. Which was nice, because I don’t get to see her that often.
  • I sat on a bag of chips.
And that’s about it. I had a very nice, pleasant weekend, but Andrea should be glad she didn’t come; it wasn’t real camping, but it still would have been torture for her.

Friday, July 04, 2008

Karma?

I was glancing at my Ubuntu wiki this morning, and noticed a little icon next to my Wikidot “avatar” (which is part of my profile) that looks like some kind of rating. I clicked it, and found out that my “karma” is very high for some reason.

karma

So I wanted to know what this “karma” thing is. It turns out that Wikidot has started calculating this karma thing based on your activity on the site. According to their explanation page, ways you can raise your karma are:
  • Creating (and editing) a wiki, which I have done
  • Participating in forums, which I have not done
  • Participating in “community portals” (whatever they are), which I have not done
  • Being a wiki admin, which I have done—but it’s part and parcel of having created a wiki
  • Inviting others to join Wikidot, which I have not done
  • Having contacts and friends on Wikidot, which I have not done
  • Using AdSense on a wiki, which I have not done (although I did briefly toy with the idea)
So, of all of the ways one can raise one’s karma, all I’ve done is create and maintain a wiki, and yet they say my karma is very high. So one of three things is happening:
  1. I’ve done a lot of work on my wiki—so much so that my karma has gone through the roof
  2. There are few active wikis on Wikidot, meaning that my wiki seems more active by comparison. And keep in mind that I don’t maintain my wiki on a regular basis, so the other wikis would have to be very dormant.
  3. Wikidot calculates karma incorrectly, and is giving me more karma than I actually deserve

Wednesday, July 02, 2008

Back from vacation

As the title of this post mentions, I’m back from my vacation. It wasn’t too exciting, but it was a rest from work, which was sorely needed. Here are some highlights—actually, not even “some” highlights, this is pretty much everything that happened:

I started off with a quick trip back home, to go out to lunch with my mom (to celebrate her birthday), and then to go to my god-daughter’s 2nd birthday party. Unfortunately, Andrea wasn’t able to make it with me, because she got roped into doing a bit of work, at the beginning of her vacation.

Later on we went to Niagra Falls for a night, but our timing was bad, so we left Toronto during rush hour traffic, and came back to Toronto in rush hour traffic, too. (Luckily, we were against the traffic for our return trip, so it wasn’t as bad.) We went to see the band A is A, who were playing at Fallsview Casino, but also to spend a night in Niagra Falls. Unfortunately, we forgot to bring the camera, so I wasn’t able to get any real good pictures of the falls. Just these two from my cell phone (which are almost identical, and were taken from a window in my hotel):

IMAGE_023

IMAGE_024

However, I also managed to get this picture, which blows the lid off the whole waterfall racket:

IMAGE_025

See? The waterfall is a scam! If you go to Niagra Falls, don’t bother wasting your film taking pictures. It’s all just special effects or something.

Our next trip was to the States, to visit Andrea’s relatives. This involved more driving—ten hours’ worth, including a bunch of driving during Toronto’s rush hour—but I don’t really mind driving, and it was beautiful scenery, so that wasn’t bad. And of course Andrea’s relatives are great cooks, so that makes for a great trip, too. Unfortunately, we forgot to bring the camera again, so I have no pictures.

After this, Andrea went to Montreal for the Jazz Festival, and I stayed home in Toronto. Unfortunately, I don’t sleep very well anymore, when Andrea’s not with me, so I didn’t catch up on my sleep, as I’d been hoping to.

Finally, when she got back from Montreal, we took another trip, to go back to my parents’ place again.

And that’s it. It wasn’t really a restful vacation, per se, but I enjoyed all of the individual things we did.