Monday, March 29, 2010

Regular Expressions

I’m sure I’ve mentioned regular expressions (regex) on this blog before. I love ’em. (Note: If you’re not a computer nerd, you don’t need to know what regular expressions are, and can ignore this post. If you are a computer nerd—in any area of computer science—you definitely should know what regular expressions are. But… you can probably still skip this post.) Such a powerful technology, and it’s already built into most programming environments. (Or even the command line, if you use any operating system other than Windowsgrep anyone?)

However, much as I enjoy the power of regex, there is no doubt that the syntax is a little… opaque. For example, suppose you want to validate that an email address is in a “correct” format. You could write some code that does the following:

  • check for the presence of the @ character (there should be one and only one)
  • see if there are any dots (and whether or not those dots occur before or after the @, because there may or may not be some before, but there has to be at least one after—but a dot can’t be the last character)
  • Check for special characters like the dash, and make sure it doesn’t come right before the @, or right before a dot. (It can exist, it just can’t exist in those special spots. e.g. you can have serna-ferna@somewhere.com but you can’t have sernaferna-@somewhere.com or sernaferna@somewhere-.com.)
And there are various other rules you might need to check. You’d probably need one or more lines of code to check each of these rules. Or you can just validate the address against a single regular expression, in one fell swoop. One line of code (in most programming environments), and you can do some very complex pattern matching.

For example, in Java, assuming we have a string called emailAddress with the address we want to validate, and a string called EMAIL_REGEX_STRING with our regular expression, we could do the following:
if(!emailAddress.matches(EMAIL_REGEX_STRING)) {
  // handle error
}
From a coding perspective, this is a lot simpler. With one line of code we can validate that email address, and the validation can be as complex as we want it to be. The regular expression can include all of the rules mentioned above, and more, all in one string.

I bring this up because I was given just such an expression today, to validate an email address. It does, indeed, validate all of the rules mentioned above. Unfortunately, it looks like this:

(?i)^[a-z0-9`!#\$%&\*\+\/=\?\^\'\-_]+((\.)+[a-z0-9`!#\$%&\*\+\/=\?\^\'\-_]+)*@([a-z0-9]+([\-][a-z0-9])*)+([\.]([a-z0-9]+([\-][a-z0-9])*)+)+$

Wow. Not so readable, eh? Just to understand it, I had to try and break it up, piece by piece, and figure out what’s going on. This is the result, with some pseudo comments in there:

(?i)                                // make the regex case-insensitive
^[a-z0-9`!#\$%&\*\+\/=\?\^\'\-_]+ // string must begin with 1 or more of the characters between the [ and ]
( // next section...
(\.)+ // if there is a dot...
[a-z0-9`!#\$%&\*\+\/=\?\^\'\-_]+ // must be followed by one or more of the characters between the [ and ]
)* // ... section happens 0 or more times
@ // followed by an @ symbol
( // next section...
[a-z0-9]+ // one or more characters of a-z or 0-9
([\-][a-z0-9])* // optionally followed by dashes, followed by a-z and/or 0-9 characters
)+ // ... section happens 1 or more times
( // next section...
[\.] // a dot
( // followed by...
[a-z0-9]+ // 1 or more a-z or 0-9 characters
([\-][a-z0-9])* // optionally followed by dashes followed by a-z and/or 0-9 characters
)+ // ... 1 or more times
)+ // ... section happens 1 or more times
$ // must end here


Still pretty bad. It’s no wonder that people take a look at regex syntax and decide they don’t have the time to learn it.

The worst part is, I think there are some mistakes in this expression, but I can’t even be sure! Can you really have a ` character or a dollar sign or an ampersand in an email address?!? Or am I even reading that right?

1 comments:

Paul said...

Yes, all those characters are valid per the RFC -- along with a few others, like { and }.

Email address validation is one of those things that SOUNDS easy, but in reality email addresses are defined by a horribly complex grammar that allows any number of non-intuitive and highly surprising constructs (like, say.. comments. No, really).

As far as email validation regexes go this is actually pretty good -- I've seen some horrible ones that fail in dumb ways.

Things this does miss (ignoring the more subtle corners of the spec) include "me@localhost" and domains in punycode for internationalisation (with a leading "xn--" in the host part)

A regex to actually valide per the RFC is enormous -- this is covered in the O'Reilly 'Mastering Regular Expressions' book, and it takes up a page or more. I think this is it here: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html