Five Habits for Successful Regular Expressions
Pages: 1, 2
3. Group the Alternation Operator
The alternation operator (|) has a low
precedence. This means that it often alternates over more than the
programmer intended. For example, a regex to extract email addresses
out of a mail file might look like:
^CC:|To:(.*)
The above attempt is incorrect, but the bugs often go unnoticed. The intent of the above regex is to find lines starting with "CC:" or "To:" and then capture any email addresses on the rest of the line.
Unfortunately, the regex doesn't actually capture anything from lines starting with "CC:" and may capture random text if "To:" appears in the middle of a line. In plain English, the regular expression matches lines beginning with "CC:" and captures nothing, or matches any line containing the text "To:" and then captures the rest of the line. Usually, it will capture plenty of addresses and nobody will notice the failings.
If that were the real intent, you should add parentheses to say it explicitly, like this:
(^CC:)|(To:(.*))
However, the real intent of the regex is to match lines starting with "CC:" or "To:" and then capture the rest of the line. The following regex does that:
^(CC:|To:)(.*)
This is a common and hard-to-catch bug. If you develop the habit of
wrapping your alternations in parentheses (or non-capturing
parentheses -- (?:…)) you can avoid this error.
4. Use Lazy Quantifiers
Most people avoid using the lazy quantifiers *?, +?, and ??, even though they are easy to understand and make many regular expressions easier to write.
Lazy quantifiers match as little text as possible while still
aiding the success of the overall match. If you write
foo(.*?)bar, the quantifier will stop matching the first
time it sees "bar", not the last time. This may be important if you are
trying to capture "###" in the text "foo###bar+++bar". A regular
quantifier would have captured "###bar+++".
Let's say you want to capture all of the phone numbers from an HTML file. You could use the phone number regular expression example we discussed earlier in this article. However, if you know that the file contains all of the phone numbers in the first column of a table, you can write a much simpler regex using lazy quantifiers:
<tr><td>(.+?)<td>
Many beginning regular expression programmers avoid lazy quantifiers with negated character classes. They write the above code as:
<tr><td>([^<]+)</td>
That works in this case, but leads to trouble if the text you are
trying to capture contains common characters from your delimiter (in
this case, </td>). If you use lazy quantifiers, you will spend
less time kludging character classes and produce clearer regular expressions.
Lazy quantifiers are most valuable when you know the structure surrounding the text you want to capture.
5. Use Available Delimiters
Perl and PHP often use the forward slash to mark the start and end of a regular expression. Python uses a variety of quotes to mark the start and end of a string, which may then be used as a regular expression. If you stick with the slash delimiter in Perl and PHP, you will have to escape any slashes in your regex. If you use regular quotes in Python, you will have to escape all of your backslashes. Choosing different delimiters or quotes allows to avoid escaping half of your regex. This makes the regex easier to read and reduces the potential for bugs when you forget to escape something.
Perl and PHP allow you to use any non-alphanumeric or whitespace
character as a delimiter. If you switch to a new delimiter, you can
avoid having to escape the forward slashes when you are trying to
match URLs or HTML tags such as "http://" or "<br />".
For example:
/http:\/\/(\S)*/
could be rewritten as:
#http://(\S)*#
Common delimiters are #, !, |. If you use square brackets, angle brackets, or curly braces, the opening and closing brackets must match.
Here are some common uses of delimiters:
| #…# | !…! | {…} |
| s|…|…| (Perl only) | s[…][…] (Perl only) | s<…>/…/ (Perl only) |
In Python, regular expressions are treated as strings first. If you
use quotes -- the regular string delimiter -- you will have to escape all of
your backslashes. However, you can use raw strings, r'',
to avoid this. If you use raw triple-quoted strings with the
re.VERBOSE option, it allows you to include
newlines.
For example:
regex = "(\\w+)(\\d+)"
could be rewritten as:
regex = r'''
(\w+)
(\d+)
'''
Conclusion
The advice in this article focuses on making regular expressions readable. In developing habits to achieve this, you will be forced to think more clearly about the design and structure of your regular expressions. This will reduce bugs and ease the life of the code maintainer. You will be especially happy if that code maintainer is you.
I would like to thank Sarah Burcham for advice on this article. Also, thanks to Jeffrey E.F. Friedl for Mastering Regular Expressions. His book serves as the foundation for everything I do with regular expressions.
Tony Stubblebine is an Internet consultant and author of Regular Expression Pocket Reference.
O'Reilly & Associates will soon release (August 2003) Regular Expression Pocket Reference.
Sample Excerpts are available free online.
You can also look at the Table of Contents, the Index, and the Full Description of the book.
For more information, or to order the book, click here.
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 12 of 12.
-
Lazy quantifiers efficiency
2004-07-14 08:31:16 Nelson Minar |
[Reply | View]
Lazy quantifiers are great, but don't they cause the regex parser to have to backtrack resulting in all sorts of performance problems if you're not careful?
-
Even better way of validating phone numbers
2004-01-24 10:06:09 developers_coach [Reply | View]
Here is a better way to validate phone numbers:
Don't!
As a resident of the UK I have on several occasions run foul of web site registration pages that insist on you entering a valid US phone number, regardless of whether you live in the US or not. I usually end up entering a totally spurious number just so that I can progress to the next screen. Even if someone is a resident of the USA, they may want to give you their mobile phone number, or a switchboard extension, or want to add additional information such as "after 6pm".
Unless you have a real cast-iron reason for ensuring that a phone number is valid, then checking for US number formats is both parochial and short sighted.
-
Whitespace not working on OS X - but works at ISP
2003-09-06 16:17:29 anonymous2 [Reply | View]
Hi there
I have tried to get it to work but I am getting problems on my Mac OS X installation of PHP Entropy 3.3.1-1 wich I think is related to a configurationsetting.
If I use the whitespace+commenting techniques described in this article it works perfect at my ISP's installation as well as in my Zend Studio 3.0.0 BETA for Mac OS X - but NOT on my mac OS X 10.2.6 machine.
If I remove all the blank space as well as the comments my code works as expected on my Mac, so I am assuming it's because of a misconfiguration - but I can't figure out what to change.
I have written a detailed explanation in this thread at Entropy's excellent forum, so I urge you to take a look at that thread in stead of having me "type" it all again here ;-)
http://www.entropy.ch/phpbb2/viewtopic.php?t=446
Otherwise thanx for some great tips - now I only hope I can get them to work on my Mac OS X box!
Thomas von Eyben -
Whitespace not working on OS X - but works at ISP FIXED
2003-09-06 17:45:36 anonymous2 [Reply | View]
Just to let everybody know that I just encountered the classic mistake of editing a file that originally had been encoded with Macintosh linebreaks in stead of using Unix linebreaks.
BBEdit is so nice as to respect those linebreaks even though I have set it up to only use unix linebreaks (actually a pretty nice behavior)
Sorry for the noise (it might be a helpful hint in other scenarious)
- Thomas von Eyben (@ 2:45 AM in Denmark...)
-
Better US Phone Number Regex
2003-09-02 11:30:10 anonymous2 [Reply | View]
I got a couple of emails about how to improve the phone number regex above. Here's a better version based on something from the Perl Cookbook, 2ed.
#!/usr/bin/perl
my @tests = ( #These match
"314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"1 800-555-4000",
"1-800-555-4000",
#These Fail
"1 800 555 5555",
"(800) 555 5555",
"800 555 5555",
"8005554444",
"1888-555-5555",
"1-800.555.4000",
"1-800-555.4000",
"1 800 555-4444",
"1-800 555-4000",
"800.555-4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345",
"(800-555-1212",
"800)555-1212",
"800)-555-1212",
"800555-1212",
);
foreach my $test (@tests) {
if ( $test =~ m/
^
(?:
(?:
1 (?: \s | ([-.]) ) # 1 followed by optional separator
\d\d\d # followed by area code
( (?(1) \1 | [-.] ) ) # followed by separator
)
| # ... or ...
(?: \(\d\d\d\) \s? ) # area code with parens
| # ... or ...
(?: \d\d\d ([-.]) ) # area code with separator
)
\d\d\d # prefix
(?(2) # match separator from "1 followed by..."
\2 # clause
|
(?(3) \3 | [-.] ) # or match separator from "area code with
) # separator" clause
\d\d\d\d # exchange
$
/x ) {
print "Matched on $test [ $1 : $2 : $3 ]\n";
}
else {
print "Failed match on $test\n";
}
}
-
US phone numbers
2003-08-29 12:20:26 anonymous2 [Reply | View]
Matching phone numbers can be tricky (it's a great exercise for 'fuzzy logic' via regexps in Perl). I've cut that gordian knot in the past by sidestepping the formatting issue:
# remove anything nonnumeric
$origphone = $phone;
$phone =~ s/\D//g;
# beginning of string, (optional 3 digits) followed by (mandatory 7 digits), end of string
if ($phone =~ /^(\d{3})?(\d{7})$/) {
($areacode, $phonenumber) = ($1,$2);
# XXX additional tests here to make sure areacode, phone don't start
# with 0 or 1, or stick it into the regexp
} else {
warn "Number ($origphone) doesn't look like a valid US phone number\n";
}
Krishna Sethuraman -
more appropriately ...
2003-08-29 12:31:01 anonymous2 [Reply | View]
Not to take away from the main point of the article; it's an ugly practical problem I've dealt with in the past and I wanted to point out another option. Re-stated in the context of the article, please replace the appropriate line in my last post with:
if ($phone =~ m/^ # beginning of string
(\d{3})? # maybe an area code
(\d{7}) # definitely 7 digits
$ # end of string
/x) {
Krishna Sethuraman
-
Another test tool
2003-08-25 18:45:47 nhabedi [Reply | View]
For Perl you might want to try "The Regex Coach", available from http://weitz.de/regex-coach/
-
Regtester
2003-08-25 10:31:20 anonymous2 [Reply | View]
A regtester specially designed for PHP is available at www.ssilk.de/PROJECTS/REGEX
After my holiday I will put online version 1.5.
-
PHP Code Flaw
2003-08-22 10:05:24 anonymous2 [Reply | View]
In your PHP example, this will not work:
$tests = ( "314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345"
);
You must change it to:
$tests = array( "314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345"
);
-
Amen to testing!
2003-08-22 07:39:48 jimothy [Reply | View]
For a project I worked on, we need the ability to parse conditional statements entered by the user (basically, name/value pairs like "orderNumber = "123456" or "quantity >= 12"). Certainly, there are options other than regex to parse that, but regex seemed the simplest solution that fit our needs.
Since this was a Java project, we used the excellent ORO library (now part of the Apache Jakarta project) and JUnit to do our testing. I wrote unit test for both conditions I expected to pass, and those that I expected to fail, as this article suggests. This turned out to be a tremendous time saver (anybody who says unit testing slows development time down needs to take another look at it).
Here's the regex I ended up with:
^[ \t]*([a-zA-Z][\\w]*)[ \t]*(<=|>=|!=|=|>|<|LIKE(?= ))[ \t]*(([\\w]+)|\"(.+)\")
Perhaps some of this articles other suggestions would make that a bit easier to read, but without unit testing, I could never have come up with a successful regex to cover all the possible conditions in a reasonable time.






