You know what I love about open source? The tricks you learn in one activity can effectively transfer over to another one. It's like using the skills learned from mathematics on the dance floor. Take regular expressions, for example. Did you know that becoming an expert in using regular expressions in Perl also makes you a PostgreSQL power user?
Regular expressions, for the uninitiated, are a set of key combinations that allow people to search for character matches, thus enabling a large variety of control in a text environment. Regular expressions are a staple of the Unix-like operating systems, although they are virtually unknown in one notable exception (go ahead and guess). They are a common feature of many applications, including text editors, command-line utilities, and programming languages to search and manipulate bodies of text based on certain patterns.
My own experiences with regular expressions began with vi (actually, Vim).
PostgreSQL supports many forms of pattern matching: the traditional SQL LIKE operator, the more recent SIMILAR TO operator (added in SQL 1999), and the Posix-style regular expressions (also implemented in the pattern-matching function substring).
More than likely a database administrator will have cut his teeth by first learning some form of programming or, better yet, systems administration on his chosen platform before playing with a database engine. In other words, he likely has already seen regular expressions long before writing his first SQL statement.
What's the big advantage using regular expressions in PostgreSQL, besides the fact that you already know about regular expressions? Two words: power and speed.
Regular expressions offer the DBA the ability to compose queries using highly complex criteria in a compact and sophisticated manner. Better, they can make your statements execute faster. Regular expressions are simpler than the longer and, sometimes, more resource-intensive operators LIKE and SIMILAR TO. Many SQL statements also need to couple traditional operators with user-defined functions to accomplish the same query as a single statement using regular expressions.
To use regexes in PostgreSQL, you must have sufficient privileges in your installation to create and populate a table. Check the postgresql.conf configuration file; you need either advanced or extended permissions.
The rest of this article presents examples based upon two single-column tables. For the sake of brevity, I've omitted the results. Therefore review the query results with a replicated session on your machine.
First, create a new database containing a table:
CREATE DATABASE regex;
CREATE TABLE myrecords(record text);
From the psql command line, populate the myrecords table. (Note: you must include a backslash followed by a period on a separate line to end date input for the COPY command.)
COPY myrecords (record) FROM stdin;
a
ab
abc
123abc
132abc
123ABC
abc123
4567
5678
6789
\.
The simplest query uses the tilde operator, ~, followed by literal quoted characters. These examples return all records that contain specific character(s) as defined by the criteria:
SELECT record FROM myrecords WHERE record ~ '1';
SELECT record FROM myrecords WHERE record ~ 'a';
SELECT record FROM myrecords WHERE record ~ 'A';
SELECT record FROM myrecords WHERE record ~ '3a';
The addition of the asterisk, *, after the tilde renders the query search case-insensitive:
SELECT record FROM myrecords WHERE record ~* 'a';
SELECT record FROM myrecords WHERE record ~* '3a';
The exclamation mark (or "bang"), !, modifies the tilde operator and excludes strings containing the character(s):
SELECT record FROM myrecords WHERE record !~ '1';
You can combine case insensitivity and exclusion:
SELECT record FROM myrecords WHERE record !~* 'c';
The caret, ^, known as an anchor, returns records with the specific character(s) located at the beginning of the string:
SELECT record FROM myrecords WHERE record ~ '^1';
SELECT record FROM myrecords WHERE record ~ '^a';
SELECT record FROM myrecords WHERE record ~* '^a';
The dollar sign, $, is also an anchor but at the end of the string:
SELECT record FROM myrecords WHERE record ~ 'c$';
SELECT record FROM myrecords WHERE record ~ 'bc$';
SELECT record FROM myrecords WHERE record ~* 'bc$';
|
Unlike the previous examples, these next regular expressions carry out queries using a pattern defined in a "class," a list of characters enclosed by brackets, [ ]. Think of a class as a simplified way of expressing logical ORs within a SQL statement.
Study the results returned by the queries:
SELECT record FROM myrecords WHERE record ~ '[a]';
SELECT record FROM myrecords WHERE record ~ '[A]';
SELECT record FROM myrecords WHERE record ~* '[a]';
SELECT record FROM myrecords WHERE record ~ '[ac]';
SELECT record FROM myrecords WHERE record ~ '[ac7]';
SELECT record FROM myrecords WHERE record ~ '[a7A]';
SELECT record FROM myrecords WHERE record ~* '[ac7]';
What happens when you search for a character, the letter z, that is not present in any string in the table?
SELECT record FROM myrecords WHERE record ~ '[z]';
SELECT record FROM myrecords WHERE record ~ '[z7]';
This statement excludes all strings with the characters 4 OR a:
SELECT record FROM myrecords WHERE record !~ '[4a]';
The use of the dash, -, in the class implies a search range--that is, to return all strings having any digits between 1 and 4:
SELECT record FROM myrecords WHERE record ~ '[1-4]';
Here are a few more range examples:
SELECT record FROM myrecords WHERE record ~ '[a-c5]';
SELECT record FROM myrecords WHERE record ~* '[a-c5]';
SELECT record FROM myrecords WHERE record ~ '[a-cA-C5-7]';
The previous examples made pattern searches one character or class at a time. However, you can use many classes in a pattern. These three statements return the same string from the example table:
SELECT record FROM myrecords WHERE record ~ '3[a]';
SELECT record FROM myrecords WHERE record ~ '[3][a]';
SELECT record FROM myrecords WHERE record ~ '[1-3]3[a]';
These two statements also return the same records:
SELECT record FROM myrecords WHERE record ~ '[23][a]';
SELECT record FROM myrecords WHERE record ~ '[2-3][a]';
Always keep in mind the potential for confusion. This next returns nothing because there is no such character string, ac, in any of the records in the example table:
SELECT record FROM myrecords WHERE record ~ '[a][c]';
However, introducing a range for each class returns the records:
SELECT record FROM myrecords WHERE record ~ '[a-b][b-c]';
To create a class of characters to exclude, insert the caret, ^, immediately after the left bracket of a class. Inserting at any other point other than as the first character in the class means you're looking for the caret in a string rather than excluding the class' character pattern.
Notice that queries can still return strings containing those excluded characters. Class exclusions prevent explicit searches using those characters.
This next statement excludes all digits from 0 to 9 from the target search. In other words, this expression returns strings that don't include digits.
SELECT record FROM myrecords WHERE record ~ '[^0-9]';
The pipe | denotes alternation. In other words, it's a logical OR for pattern searches. Suppose you want to return all records with strings that begin with a or end with c:
SELECT record FROM myrecords WHERE record ~ '^a|c$';
This statement returns strings beginning either with a or 5, or ending with c:
SELECT record FROM myrecords WHERE record ~ '^a|c$|^5';
This next example performs a targeted search excluding digits and all lowercase letters. Control characters, spaces, and uppercase letters can all appear in the search pattern:
SELECT record FROM myrecords WHERE record ~ '[^0-9|^a-z]';
Sometimes you need to find strings that have repeated sets of the same character(s). You can find them with quantifiers--represented by the characters *, ?, and + as well as by digits enclosed within braces { }.
To find a sequence of 0 or more matches, use *:
SELECT record FROM myrecords WHERE record ~ 'a*'';
To find a sequence of one or more matches, use +:
SELECT record FROM myrecords WHERE record ~ 'b+';
To find a sequences of zero or one occurrence, use ?:
SELECT record FROM myrecords WHERE record ~ 'a?';
To find a sequences of exactly # matches, use {#}:
SELECT record FROM myrecords WHERE record ~ '[0-9]{3}';
To find a sequences of exactly # or more matches, use {#,}:
SELECT record FROM myrecords WHERE record ~ '[0-9]{4,}';
To find a sequences of # through ## (inclusive) matches, where # does not exceed ##, use {#, ##}:
SELECT record FROM myrecords WHERE record ~ '[a-c0-9]{2,3}';
|
As cool as regular expressions are for returning strings, they really shine when used for parsing text strings. PostgreSQL implements an overloaded substring() function that returns strings based on character position using integer values and, more importantly, by the use of regular expressions against strings and their substrings. You now have the awesome ability of returning formatted substrings from large, complex strings with very little coding instruction.
I'm going to demonstrate the power of regular expressions by generating a formatted query. The query will parse firewall syslog records and return the date, the time, the remote IP address, the originating source port, and the port that it's attempting to connect on the attacked server.
Create a single-column table called log:
CREATE TABLE log(record text);
Populate this table by executing this SQL file from your psql session:
\i log.sql
First things first, try to retrieve return one record using a generic regular expression. (Note that there is a space at the end of the class definition.)
SELECT substring(record, '[a-zA-Z0-9:. ]{1,}') FROM log LIMIT 1;
This next query returns a valid date stamp. The date function serves the double function of formatting the date as well as validating it, thus catching possible errors:
SELECT date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005')
AS "Date" FROM log LIMIT 1;
Now test this new query to return the time:
SELECT substring('Nov 3 07:37:51 localhost', '[:0-9]{2,}') AS "Time";
But wait--it doesn't work if you were to add 0 to the 3, thus making the date November 30:
SELECT substring('Nov 30 07:37:51 localhost', '[:0-9]{2,}') AS "Time";
Edit the query to correct the problem by insisting that it looks for three sequence character patterns instead of two, which the day of the month will never be (one hopes, anyway):
SELECT substring('Nov 30 07:37:51 localhost', '[:0-9]{3,}') AS "Time";
If that works, test one real record:
SELECT substring(record, '[:0-9]{3,}') AS "Time" FROM log LIMIT 1;
People recognize the remote IP address by scanning the syslog line and locating the character string SRC= and the concatenated IP address. This query emulates the same process by looking for a sequence of at least two characters in length that includes a period and a digit. This sequence immediately follows the characters SRC=.
SELECT substring(record, 'SRC=*([.0-9]{2,})') AS "IP Address" FROM log LIMIT 1;
That's almost too easy! Repeat the process to obtain the IP address for the source port (SPT=) and destination port (DPT=), respectively:
SELECT substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port"
FROM log LIMIT 1;
SELECT substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
FROM log LIMIT 1;
Here's the complete SQL statement:
SELECT
date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005') AS "Date",
substring(record, '[:0-9]{3,}') AS "Time",
substring(record, 'SRC=*([.0-9]{2,})') AS "Remote IP Address",
substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port",
substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
FROM log;
It's important to experiment with your expressions before putting them into a production environment; otherwise, you may be surprised by unintended results.
Guess what? There's still more that I've haven't covered! Among other things, PostgreSQL also provides class shortcuts, as per the Posix standard, including digits, alphanumeric characters, and control characters. PostgreSQL provides escape sequences for characters and Unicode too.
You'll find that the biggest hurdle using regular expressions is being comfortable with them, as they are so badly taught. Therefore, people often have to learn them without support. Sometimes they're lucky and they can live with a simple understanding, or perhaps they found a good book.
I hope you can consider this article as one such example of a good book.
Chapter 9.7 of the PostgreSQL manual, Functions and Operators, explains more about matching functions and string functions.
Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).
Return to O'Reilly Databases
Copyright © 2007 O'Reilly Media, Inc.