Saturday, July 19, 2008

Beautiful Code: False Optimizations

This post is from the Beautiful Code series.

One of the first concepts that jumped out at me from Beautiful Code concerned false optimizations—that is, something added to the code which should have been an optimization, but as it turns out, actually causes the code to run slower.

As an example, we’ll look at some code that implements a binary search, from a chapter written by Tim Bray:

package Binary;

public class Finder {
public static int find(String[] keys, String target) {
int high = keys.length;
int low = -1;
while (high - low > 1) {
int probe = (low + high) >>> 1;
if (keys[probe].compareTo(target) > 0)
high = probe;
else
low = probe;
}
if (low == -1 || keys[low].compareTo(target) != 0)
return -1;
else
return low;
}
}
As mentioned, this function simply searches through some strings, using the binary search algorithm, and sees if there is a match. I just quickly read through the code, and didn’t realize—until Bray pointed it out—that there is no explicit check within the loop to see if the match has been made. One would expect him to add that in, as an optimization, so that when the target is found, the loop could exit, and not bother continuing with its checks. But as it turns out, that would be a false optimization; mathematically, the match is likely to be found late enough in the loop that the extra if statements during every execution of the loop would be more of a performance hit than simply letting the loop execute a few more times. Or, as Tim puts it:

Some look at my binary-search algorithm and ask why the loop always runs to the end without checking whether it’s found the target. In fact, this is the correct behavior; the math is beyond the scope of this chapter, but with a little work, you should be able to get an intuitive feeling for it—and this is the kind of intuition I’ve observed in some of the great programmers I’ve worked with.

Let’s think about the progress of the loop. Suppose you have n elements in the array, where n is some really large number. The chance of finding the target the first time through is 1/n, a really small number. The next iteration (after you divide the search set in half) is 1/(n/2)—still small—and so on. In fact, the chance of hitting the target becomes significant only when you’re down to 10 or 20 elements, which is to say maybe the last four times through the loop. And in the case where the search fails (which is common in many applications), those extra tests are pure overhead.

You could do the math to figure out when the probability of hitting the target approaches 50 percent, but qualitatively, ask yourself: does it make sense to add extra complexity to each step of an 0(log2 N) algorithm when the chances are it will only save a small number of steps at the end?

The take-away lesson is that binary search, done properly, is a two-step process. First, write an efficient loop that positions your low and high bounds properly, then add a simple check to see whether you hit or missed.

Let’s look at another example. This is a snippet of code from Elliotte Rusty Harold, from XOM, used in verifying XML documents. After verifying an XML namespace (which is a URI), it is cached; when a namespace is encountered, the parser can look in the cache, to see if it has already been verified, and, if so, not bother verifying it again. This class is the cache Elliotte uses.
private final static class URICache {

private final static int LOAD = 6;
private String[] cache = new String[LOAD];
private int position = 0;

synchronized boolean contains(String s) {

for (int i = 0; i < LOAD; i++) {
// Here I'm assuming the namespace URIs are interned.
// This is commonly but not always true. This won't
// break if they haven't been. Using equals() instead
// of == is faster when the namespace URIs haven't been
// interned but slower if they have.
if (s == cache[i]) {
return true;
}
}
return false;
}

synchronized void put(String s) {
cache[position] = s;
position++;
if (position == LOAD) position = 0;
}
}
This code is fairly simple. It’s a helper class used to maintain a list of namespace URIs that have been captured, and a method which will tell you if a URI is in the list or not. The surprising thing is that the helper class uses a static array, with exactly six items, and if you try to add a seventh, it simply overwrites the first one (and so on). Why not use a hash map, or a table? Something dynamic, which could grow as large as necessary? Why a statically-sized array?

Again, the answer is simple performance, because Elliotte Rusty Harold has seen enough XML documents to know that rarely will any document have more than six namespaces defined in it. So by using a statically-sized array, instead of something more dynamic, he gets a boost in performance in the majority of cases. In the case where a document does have more than six namespace URIs, validation will simply take a little longer for that document, because some namespace URIs may have to be verified twice. (If the parser comes across a URI which has already been verified, but has been overwritten in the cache, then it would have to be re-verified. Of course, if the URI hasn’t been overwritten, it still won’t have to be re-verified.)

This was counter-intuitive for me, at first; it seems like the “Right Thing” would be to always make the size of your list dynamic, to deal with as many elements as are necessary. In fact, at first glance, it almost seems like a rookie developer’s mistake, to use a statically-sized list. But in this case, it was done on purpose, and for very good reason.

So again, he’s not doing something that he doesn’t have to do, even though, at first glance, it would seem like an optimization. Without thinking about it, I’d have made the size of the array dynamic, so that I’d never have to verify a URI that had already been verified, but Elliotte thought more deeply about it, and realized that making the code more complex in that manner would not have been an improvement. And he’s right; that array will be accessed much more often than it’s added to, so he was able to optimize for the majority of cases, with a small potential performance hit in the minority of cases.

21 comments:

lorg said...

Regarding your second example, I'm not sure I agree about calling 'making the list dynamic' an added complexity.

Given a language like Python, using dynamic allocation is the default. Furthermore, using a set() instead of a list() is effortless.

Adam Hupp said...

@lorg: also remember that linear search is faster than set lookup until some surprisingly large N. At least in python it is.

l0b0 said...

Interesting article with good examples, but I can't help thinking that I'd mess up completely if I started doing this - After all, the code ends up being more counter-intuitive, even to an experienced programmer like yourself. For me it would probably be premature optimization in most cases, and it's especially hard since I'd have to do a lot of tests to see if there's really anything to be gained. For example, I'm not sure what is meant by "verifying an XML namespace", but if it takes a significant portion of a millisecond, I'm pretty sure it would be better to have either a much bigger cache (space is cheap and plentiful) or a dynamic structure. A possible optimization to the latter would be to use a dynamic array with geometric expansion. In any case, hard coding the cache size seems a poor way to utilize future hardware.

Anonymous said...

The real point is to put a detailed comment in the code to explain why this is done like that.

lorg said...

Adam Hupp:
I didn't argue that it isn't. In most cases, I would choose the best fitting concept (in this case, a set).

After discovering the code is need of optimization, and finding out that adding to the set and/or checking membership in it are costly, I will also consider other data structures.

Anonymous said...

The XML parsing example, I disagree with. The problem is that while the code is potentially faster in the typical case (not obvious, since a dynamically growing array should be the same speed), performance falls like a rock, unpredictably, on some datasets. This can lead to security issues (denial of service attacks) and other bugs. This would, in most cases, be bad code. You would either want less optimized code (use a hash table) if the performance didn't matter that much, or properly optimized code if the performance did matter (growing array, or small array with a hash table fallback).

sernaferna said...

Regarding the XML example, many of you have missed the point—which is not unusual, because many would prefer to comment than to think.

There are few in the world who would know XML documents better than Elliotte Rusty Harold, and even for those of us who don't see them as regularly as he does, the insight is a good one: There will be very few documents in the world that will have more than six namespaces in it. I'm sure you can come up with examples of ones that do, but the vast, vast majority of documents will have less than six.

And again, even if a document does have more than six, the potential hit in performance is still not going to break the bank; it just means a [potential] extra validation.

I agree with l0b0, in the sense that you never want to optimize anything unless you know what you're doing. As Elliotte Rusty Harold did, when he wrote his code—and which we're now attempting to pick apart, without his knowledge, in the comments section of my blog.

Anonymous said...

Where's the proof? If you want us to believe you on stuff like this we need to see some timings showing that there is in fact a gain and that gain is significant.

Anonymous said...

Not that in principle I don't agree with going with the simplest solution that works, but you're wrong on both of the examples.

1. Adding a branch would not be a performance hit, because any recent CPU has branch prediction and insturuction prefetch, etc. It is only a hit once when the prediction proves wrong and the loop needs to terminate. Still, that's less instructions than executing the loop a few more times.

2. Some people above mention that linear search is fast for small number of elements. Don't forget that this is a search within strings. String comparison in itself is a linear operation. So is string hashing. The correct solution, that is the most performant, would be one that avoids one of these. I'd go with a radix tree. Lookup is OMEGA(keylength), that is equivalent to a worst-case string comparison. And if there's no match, insert is constant time. And it's a struture that can grow dynamically, although limiting the size is not a bad idea. A new insert and a affix removal is again O(k)+O(1) time, not slower than with a static array.
And, it's good for cache locality!

Anonymous said...

Its false calim that it is optimized, since no hard data on speed in shown. If you want to optimize you have to measure the time before and after the optimization.

Since Finder is a Java-class you can not make any assumtion on what processor or JVM you are using. It could be a ARM-processor, CELL-processor and an X86-processor. What JVM? JRockIt or SUN-Jvm or Googles Android?

Anonymous said...

The URICache has synchronized boolean contains!!!
synchronized is the slowest keyword in Java. Involving locking and unlocking tying up alot of system resources. It just bad code to do use synchronized. Use java.util.concurrent if you really need to have thread safety. Which in this case seams to be useless.
Any code with a non-blocking-algorithm is much faster.

Anonymous said...

sernaferna: I get the point you're trying to make with the XML example, and it's a good one.

I think you've also fallen victim to commenting before thinking, though, in your claim that those criticizing it is missing the point.

I think the point that people are responding to is:

"Elliotte thought more deeply about it, and realized that making the code more complex in that manner would not have been an improvement".

People are taking issue with the idea that the linear search implementation is less complex than, say, this:

uri_cache = set()

Then you don't even have to write contains and put methods, you just use "if uri in uri_cache:..." or "uri_cache.add(uri)".

You might make a decent case that the algorithm the computer actually runs is simpler, but the idea that the code is less complex is inaccurate almost to the point of being offensive.

Oliver said...

@adam hupp: But the n is really tiny, at least with python 2.5. A __contains__ call on a list ist only faster, if the element is the first in the list. Otherwise a set is always faster. See http://paste.pocoo.org/show/79984/

sernaferna said...

Hi Vineet. You may be right; I'm quick to respond to the comments. In fact, you had me right up until the end, but the "inaccurate almost to the point of being offensive" made me laugh. ;-)

A number of the commenters on this post are suffering from missing the forest for the trees. Some of them are missing the point. In fact, ironically enough, some of the people who think that they're disagreeing with me are making the same point:

No optimization is universal, and you should know all of your context before you make an optimization. (In a related point, no optimization is a real optimization unless you can actually back it up with some numbers. Some of the commenters mentioned that, although for some reason they've assumed that Elliotte Rusty Harold—who, by this point, probably wishes his name was never mentioned in this post—never did any performance testing, which is a silly assumption to make.)

Anonymous said...

I do have a question about the XML example - why is the array sized at exactly six entries? I get that "most" XML docs have 6 or fewer namespaces, but if "most" of those exceptions have 10 or fewer, then making the array size 10 would be even better, at virtually no cost, since linear search is excellent at sizes of small integers.....

Wesley Walser said...

"1. Adding a branch would not be a performance hit, because any recent CPU has branch prediction and insturuction prefetch, etc. It is only a hit once when the prediction proves wrong and the loop needs to terminate. Still, that's less instructions than executing the loop a few more times."

I think the first example in this article is spot on.

Correct me if I am wrong, but if the branch prediction is wrong, the pipeline has to be cleared of those prefetched instructions creating some bubbles which is going to be a hit.

The author of the above quote went on to belabor us on the fact that we can't make assumptions about what JVM one is running on and yet assumes that the branch prediction algo on an, also unknown, processor is always going to be correct. What a joke.

To the author of this blog: I enjoyed the post, keep the good content coming.

Hazridi said...

Branch prediction should notice that it's in a loop and assume that the loop will continue. This is common practice in most branch prediction CPUs.

This basically removes the cost of the if() check while ensuring that early termination still helps performance, since the price of a stalled pipeline is pretty low since the loop is no longer executing.

I'm not missing the point of the article, I just think the blog author chose some bad examples of false optimization.

Anonymous said...

Optimization is in the eye of the compiler mostly.
Dynamic arrays, dynamic pointers, dynamic almost anything is a pure performance hit.
The use of static arrays is a nice touch because the address of the static array is inlinable whereas the dynamic stuff has to be indexed from the allocated resources list then looked up.
Unnecessary allocation is pure evil no matter what platform is used.
On a main stream machine the xml or net address stuff is of no concern, on a sub gigahertz athlon machine acting as a router it would make a difference and I would then add something like a rollover check post 'add to array' for excessive namespaces that would send some kind of message when ever it happened, Then if needed I could increase the static array size or go find the offender who thinks a pile is the only way to organize.
Just because its called a heap, doesn't mean it needs to be a messy one.

ltbarcly said...

@Adam Hupp: Make things up much?

In [1]: import random

In [2]: import string

In [3]: string.letters
Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [4]: searchable = [''.join(random.choice(string.letters) for x in range(10)) for x in range(10)]

In [5]: searchable
Out[5]:
['sJXEfKvSjU',
'ywnVmKpKBG',
'QUbmTBGOCW',
'QkqaWRMpmT',
'VOuLIDfrAn',
'IdrcwZzxHV',
'NznLKoRvML',
'jWiBxDfLYi',
'lisIbRNAGw',
'hLedQvWRQx']

In [6]: %timeit 'VOuLIDfrAn' in searchable
1000000 loops, best of 3: 467 ns per loop

In [7]: searchableset = set(searchable)

In [8]: %timeit 'VOuLIDfrAn' in searchableset
1000000 loops, best of 3: 276 ns per loop

So unless '10' is 'some surprisingly large N', a set search for strings is faster. I think in most cases, once the set is constructed, N is 3. String comparison is SLOW compared to hash lookup.

Simon said...

I don't believe that commenters are missing the point by taking issue with the XML example: it's a blatant case of premature optimisation. The irony of it being presented as quite the opposite is hard to miss.

Perhaps Elliot benchmarked it and found it to be faster...but to the extent that anyone would ever notice in real life? Really? When you're parsing XML and repeatedly calling synchronized methods? I very, very much doubt it.

So yet again, we end up with this overcomplicated, confusing and slightly faulty code, in the name of optimisation.

It troubles me enough that you present this as good code, but it staggers me that you think that making the array dynamic, and not writing put() in the first place would be to add complexity.

Adam Hupp said...

I could have sworn I tested this at some point, I guess not. Thanks for the the correction.