O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Top Ten Data Crunching Tips and Tricks

by Greg Wilson, author of Data Crunching: Solve Everyday Problems Using Java, Python, and More
06/09/2005

Every day, all over the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check configuration files, and yank data out of web server logs. This kind of programming is usually called data crunching, and while it's not glamorous, knowing how to do it with the least amount of effort can make the difference between meeting a deadline and making another pot of coffee. These ten tips will take the headache out of crunching data.

1. Master the Classic Tools

In an age of 3D window managers and web-based everything, it's easy to forget that the Unix command-line model of pipes, filters, and redirection is still the best way to solve most data crunching problems. Almost everything you'll ever want to do to text data has been done before, so it pays to master the more powerful tools like cut, grep, sed, find, and xargs.

Similarly, you should learn how to use a programmable cross-platform editor, such as Emacs or Vim, since many simple crunching tasks can be solved simply by recording macros and playing them back. If you're working in a pure-Microsoft environment, learn enough Visual Basic to script the Visual Studio editor. (The easiest way to do this is to record a few simple operations, then look at the code Visual Studio has written for you.)

One thing I don't think it's worth learning any longer is advanced shell programming. When you have serious crunching to do, you're probably going to want to do at least part of it in Python, Ruby, or some other freely available scripting language, so you might as well start there in the first place.

2. Separate Input, Processing, and Output

You'll occasionally be tempted to solve simple data crunching problems with one-off scripts that transform each record as it's read. Don't give in---experience shows that "one-offs" usually hang around longer than their authors expect, and that simple transformations usually become more complex over time. Instead, always write data crunchers in three parts: one that reads the input into memory, a second that transforms the data, and a third that writes out the data structures. Programs structured this way are a lot simpler to debug, and their parts are easier to modify and reuse.

And while we're on the subject of reuse: every software project's version control repository should have a misc or tools directory, and every little cruncher you write should be saved there. If nothing else, the odds and ends you put there can serve as starting points for the next script you have to write.

3. Store Format Information in the Data Itself

Programs come and go, but data lives forever. Any time you're working with data that isn't in a well-documented standard format, you should therefore try to include format information in the data itself, so that it can't ever be lost. If you're working with binary data, store the format template for each record at the start of the data; for example, if you're archiving a database table whose rows contain an 80-character string, two integers, and a variable-length string, you might store this:

680c2iS
    

before the first record. The 6 tells whoever's reading the data that the format specifier is six characters long; those six characters tell the reader to look for 80 characters, two integers, and a null-terminated block of text.

Related Reading

Data Crunching
Solve Everyday Problems Using Java, Python, and more.
By Greg Wilson

If you're making up a new text-based format, allow for headers (like the meta tags in an HTML page's head section), or put format information in specially structured comments, so that whoever has to read your legacy files ten years from now will know how to parse them.

4. Understand International Character Sets

Do you speak Thai? Russian? Arabic? Thanks to the internet, the odds are good that at least a few of your users do, but if you mangle their data, they may not be your users for long. It's therefore essential for your data crunchers to handle international character sets correctly. In particular:

  • Make sure you know how data has been encoded; you can't read even the simplest text file unless you know whether it's in UTF-8, UCS-4, or some legacy format. And if you get to choose a format, always choose UTF-8: it encodes the original 7-bit ASCII characters using the values 0-127, just like ASCII did, which improves the odds that your output won't confuse tools (and programmers) who aren't yet thinking about these issues.
  • Don't assume that one character equals one byte (or two bytes, or four). Many character set encodings (particularly the most common, UTF-8) use a variable number of bytes for each character; one side effect of this is that strlen() and other functions from the standard C library (and everything that depends upon them) can no longer be trusted.
  • When writing regular expressions, use character blocks, like \p{InCyrillic}, and character categories, like \p{Lu} (meaning "lower-case letters"), since these will work no matter what language your text is in, or how it has been encoded.

5. Use Reluctant Matches and Graphical Tools

While we're on the topic of regular expressions, suppose you're trying to extract currency amounts from an email archive. Your first attempt uses the following regular expression:

^(.*)(\d+\.\d{2})(.*)$

which matches:

  • The start of the line
  • Zero or more characters
  • One or more digits, followed by a literal ., followed by exactly two digits
  • Zero or more characters
  • The end of the line

The problem is, when you apply this pattern to the text:

Charge $229.95, and get a receipt.

the middle group only matches 9.95. The reason is that regular expressions are greedy: each part of the pattern tries to match as much text as it can, leaving as little as possible for subsequent pattern elements.

Most regular expression (RE) packages now support reluctant (also called generous) patterns, which match as little text as possible. Each of the operators ?, *, and + has a reluctant counterpart: ??, *?, and +?. Making the first group in the pattern reluctant:

^(.*?)(\d+\.\d{2})(.*)$

solves our problem.

As well as illustrating reluctant matches, this example also shows why regular expressions can be so hard to debug: adding one character completely changes the way the RE behaves. This is why I'm very fond of graphical tools for building and debugging REs, such as Edi Weitz's Regex Coach, JRegexpTester, and the Rx Toolkit in ActiveState's Komodo IDE.

6. Nest and Subtract to Negate

Suppose you have a database that records which programmers are assigned to which projects, and you want a list of everyone who isn't working on version 2.1 of Frib. You might try this:

SELECT Assigned.EmployeeId
FROM   Assigned, ProjectInfo
WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
AND    ProjectInfo.ProjectName != "Frib 2.1";

However, if Alan Turing is working on both Frib 2.1 and Snab 3.0, his name will appear in this query's results.

The right way to tackle this problem is to use a nested query to select the records you don't want, and then subtract them from the set of all possible records. It feels clumsy the first few times you do it, but---oh, who am I kidding? It always feels clumsy, but it's what you have to do.

Here's the complete query:

SELECT Assigned.EmployeeId
FROM   Assigned
WHERE  Assigned.EmployeeId NOT IN
      (SELECT Assigned.EmployeeId
       FROM   Assigned, ProjectInfo
       WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
       AND    ProjectInfo.ProjectName = "Frib 2.1");

The nested query finds all employees who are working on Frib 2.1; the outer query then removes those IDs from the set of all employee IDs.

Depending on which database manager you're using, it may help to turn the nested query into a view. A view acts like a stored SELECT statement whose result can be treated as a temporary table. In this case, we could use:

CREATE VIEW WorkingOnFrob AS
SELECT Assigned.EmployeeId AS EmployeeId
FROM   Assigned, ProjectInfo
WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
AND    ProjectInfo.ProjectName = "Frib 2.1";

SELECT Assigned.EmployeeId
FROM   Assigned, WorkingOnFrob
WHERE  Assigned.EmployeeId NOT IN WorkingOnFrob.EmployeeId;

7. Check for Holes

In the real world, data is never complete: there are always people who don't have a fax number or whose blood type is unknown. As a result, the data you're crunching may have holes. If you're lucky, these will be marked explicitly---many relational databases, for example, use NULL to indicate missing values. If you're unlucky, you'll have to figure out when an empty string or -1 means what it says, and when it means that whoever created the data didn't know what else to put.

It's particularly important to keep holes in mind when combining data. For example, if you've been asked to calculate the average sales price for a set of books, and some of them were given away as part of a promotional deal, you'll have to decide whether to count them as $0.00, or leave them out entirely.

The SQL standard tries to help programmers handle missing data by specifying that any operation involving NULL returns NULL as a result; i.e., 5 + NULL is NULL, as is (X < 0) OR (Y > 100) if either X or Y is NULL. Unfortunately, not all databases implement this correctly: in some, TRUE OR NULL is TRUE, while FALSE AND NULL is FALSE. Having been bitten by this a couple of times, I now check for records containing NULL before doing any other crunching.

8. Use XSLT (with Caveats)

XSLT is an easy way to specify simple transformations, but only when your input is highly structured (that is, where you don't need any information about context in order to interpret particular elements). However, the less structured your documents are--that is, the easier they are for human beings to type in and read---the more complex your transformation rules have to be, and experience shows that the difficulty of debugging XSLT transformations grows very quickly.

For example, transforming this:

<b1><t>XSLT is a simple way to solve simple problems</t>
  <b2><t>But complex scripts are fiendishly difficult to debug</t></b2>
</b1>
<b1><t>Options:</t>
  <b2><t>Put more structural information in your input.</t>
    <b3><t>But this makes it harder for people to create and read.</t></b3>
  </b2>
  <b2><t>Use a different tool.</t></b2>
</b1>

is almost trivial, since the text for each bullet point is wrapped in <t>...</t>. Take that away, so that whoever's creating the data has less typing to do:

<b1>XSLT is a simple way to solve simple problems
  <b2>But complex scripts are fiendishly difficult to debug</b2>
</b1>
<b1>Options:
  <b2>Put more structural information in your input.
    <b3>But this makes it harder for people to create and read.</b3>
  </b2>
  <b2>Use a different tool.</b2>
</b1>

and the XSLT transformation nearly triples in complexity.

9. Use XPath Without Using XSLT

XPath is a companion standard to XSLT, so you can use it without XSLT. XPath lets you identity parts of an XML document using a notation similar to that used for paths and filenames in the shell. For example, /project/* means, "All children of project elements," while ticket[@status='open']/author means, "Authors of tickets whose status attribute has the value 'open'." Several modern XML libraries, such as Fredrik Lundh's ElementTree package for Python, include XPath; data crunchers that use it are a lot easier to read (and write) than ones that walk trees themselves, and can sometimes be faster, too.

10. Check Your Work

If you don't know how to check your work, it's almost certainly wrong. Like everything you write, data crunchers should be tested. Sometimes, this is as simple as running it on the data you actually want to transform, and eyeballing the results; in more complex situations, you may want to select some subsets of your actual data, and make sure that each is handled correctly.

The most important thing, however, is figuring out how to tell the right answer from a wrong one. If you don't know enough about your input and output to be able to look at a specific case and say, "Wait a second, that's not right," then the odds are good that you don't know enough about your problem to write the transformation correctly. And please--when you do figure it out, write it down somewhere, so that the next time someone has to wrestle with the problem, it'll take them less time to figure it out.

Greg Wilson has been crunching data for more than 20 years. He is an independent programming consultant, an adjunct professor at the University of Toronto, and a contributing editor with Doctor Dobb's Journal. He is the author of Data Crunching and Practical Parallel Programming.


Return to ONLamp.com.


Any data crunching tips to add? Let us know.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Newest First

Showing messages 1 through 12 of 12.

  • non-greedy regular expressions is the common term
    2005-06-09 21:51:35  wixton [Reply | View]

    not reluctant, or generous. regular expressions were popularized most by perl, and in that community and others) the terminology is non-greedy regular expressions. a search shows it's mostly java heads who are using this alternate terminology.
    • non-greedy regular expressions is the common term
      2005-06-22 05:52:42  GregWilson [Reply | View]

      Yes, but (a) there are more Java heads out there than Perl heads (even today), and (b) I believe the term "reluctant" pre-dates the first version of Perl by about twenty years. And let's face it --- if I used the term "non-greedy", some Java head would have complained (and some Lisp programmer would have pointed out that "non-greedy" is ambiguous, since there are N different ways to define pattern matching that aren't strictly greedy, all of which are implemented by the following set of mutually-recursive functions...)
  • "read the input into memory" considered harmful
    2005-06-14 08:28:21  johnsaalwaechter [Reply | View]

    I don't agree with point #2, especially the advice to read the input into memory. I work in an environment with large data sets, and I've seen many instances where a script that reads the input into memory is developed on small test data, then unleashed on gigabytes of real data. Either the process exceeds its 4GB address space (most perl executables are still 32-bit), or the server itself runs out of memory.

    Definitely using strategies other than "suck it all into memory" is required in many environments.
    • "read the input into memory" considered harmful
      2005-06-22 05:49:40  GregWilson [Reply | View]

      I agree, out-of-core algorithms that don't pull everything into memory at once are absolutely necessary in a lot of cases. However, the focus of the book was on automating odds-and-ends tasks, like pulling sales ranking data off Amazon and finding peaks and valleys. (Gosh, why would I be doing that...? ;-) If you can process your data record by record, that's great; if you can't, and your data won't fit in core, then what you have is a real programming task, rather than a one-off throwaway script.
  • In reply to "read the input into memory" considered harmful
    2005-06-14 08:55:44  jtaylor247 [Reply | View]

    What Greg suggested does not neccessitate a script loading all records into memory first. If you have three functions read(), transform(), and write() you can simply apply these in a read() transform() write() loop that loops once per record leaving only one record in memory at a time.
  • Data crunching? Crunching? Naw...
    2005-06-16 06:02:14  ScarletKnight [Reply | View]

    ... I'm a former mainframe programmer. You don't crunch anything close to data here. Those tips are definitely less than useful. Simple SQL tips for data crunching? Give me a break. I thought I was going to read about how to optimize your transforms for speed without losing maintainability, or even some references to offload techniques to stay out of the database when it doesn't make sense, i.e. knowing the limits of set-oriented transforms versus proecedural ones.

    Thumbs down.
    • Data crunching? Crunching? Naw...
      2005-06-19 23:33:36  alcabon [Reply | View]

      I m not agree with you at all. This article is a clear synthesis about the common techniques of data crunching. I was working a long time as a mainframe programmer (IBM MVS). Mainframe means large company and often high license costs for tools like SAS for example (150.000 euros/year). With SAS, you can also do data crunching (sql + SAS language) but you cannot write a large audience book saying : "you must use SAS because it is a good tool for data crunching". With unix, large company can also pay the license for Informatica for example (I used it also during months) for EDI especially but the cost is also prohibitive for small companies. I am working now in an simple unix-windows intranet environment (without SAS, Syncsort or Informatica) and Greg Wilson thinks straigth. I had started to use computer with a ZX81 16ko and 3,5 Hz now I ve got a Pentium 4-1024 Mo RAM and 120 go disk. Using memory is completely different in this condition. For offload techniques, it is a DBA problem related to your RDBMS. With Cloudscape (free and 2 Mo size footprint), I succeded import (csv files) and export with modifications of formats (using sql requests and functions) the data of millions of rows under some minutes using my simple PC. It is very easy to do. The most important for this work of data crunching with RDBMS is to write the straight sql code.
      Best regards.
      • Data crunching? Crunching? Naw...
        2005-06-19 23:52:23  alcabon [Reply | View]

        ... 150.000 euros/year is a possible license cost for some tools. It is not the cost license of SAS of course.
    • Data crunching? Crunching? Naw...
      2005-06-22 05:46:02  GregWilson [Reply | View]

      I'm sorry you think I'm not crunching "anything close to data" here. I routinely use these techniques to do one-time reformatting of gigabyte (and larger) scientific data sets. That may no longer be as impressive as it once was (I can still remember the first time I saw a file that was more than a megabyte long ;-), but as I say a couple of times in the book, my focus is on the things you can do in a few minutes to handle everyday tasks, not on what you'd do to run Google or American Express.
  • On the tools to learn
    2005-08-09 00:04:12  Paddy3118 [Reply | View]

    I do most of my datatransformation/crunching work on Unix.
    I do beleive you should learn the tools available but I would change the list:

    I have not found much use for cut over the years

    Because I'm that old, I first learned AWK and
    still use it extensively, even though I also
    know perl.

    If your writting out data then try and write it
    in a way suitable for an AWK program to read it.
    I've always found that AWK-able data formats are
    easy to parse in perl, python, tcl, bash, and C.
    (This assumes that you can set the data output
    format.

    A bit of context: I'm in the Electronic Design Automation field, where a lot of tools come with in-built tcl interpreters, data is often in indented lisp-like lists, AWK-able tables, and machine generated programming code in Verilog or VHDL.

    - Paddy.
  • The XSLT example is...not so good
    2006-02-17 11:07:35  R.Rossney [Reply | View]

    Greg says that the first example "is almost trivial, since the text for each bullet point is wrapped in <t>...</t>. Take that away, so that whoever's creating the data has less typing to do...and the XSLT transformation nearly triples in complexity."

    I don't see how. This:


    <xsl:template match="text()">


    is not three times as complex as this:


    <xsl:template match="t">


    In my experience, XSLT is not "fiendishly difficult to debug" at all, as long as you understand its processing model and follow good practices. It's certainly harder to hack something together in XSLT if you don't know what you're doing than it is in Perl. But if you have scaled its (nontrivial) learning curve, it's a very powerful and expressive language. And its freedom from side-effects makes your transforms remarkably stable and reliable.

    Just don't try to use XSLT to transform data that is only in XML because some clown decided to jump on a bandwagon. I'm thinking here of Microsoft's incredibly unusable Excel XML format. Or formats I've seen where someone takes newline-terminated text data and turns it into XML by sticking <record> at the start of every line and </record> at the end. Oh, that's useful.
  • Working on big data sets
    2009-08-11 01:15:34  Bozdag [Reply | View]

    It usually takes hours or maybe days if you want to crunch a huge volume of data. It is almost impossible to load everything on to memory. Then, what is the best practice to read and transform that data? Using a RDBMS or a special file format like HDF5 ?


Tagged Articles

Post to del.icio.us

This article has been tagged:

programming

Articles that share the tag programming:

Rolling with Ruby on Rails (1374 tags)

Very Dynamic Web Interfaces (279 tags)

Ajax on Rails (231 tags)

Understanding MVC in PHP (202 tags)

A Simpler Ajax Path (186 tags)

View All

data

Articles that share the tag data:

Top Ten Data Crunching Tips and Tricks (39 tags)

Data Munging with Sprog (7 tags)

Query Census Data with RDF (6 tags)

Massive Data Aggregation with Perl (5 tags)

Screenscraping the Senate (2 tags)

View All

tips

Articles that share the tag tips:

Top Ten Digital Photography Tips (165 tags)

Top Ten Mac OS X Tips for Unix Geeks (163 tags)

Top Ten Data Crunching Tips and Tricks (30 tags)

Ten Essential Development Practices (26 tags)

View All

linux

Articles that share the tag linux:

Managing Disk Space with LVM (74 tags)

Use Your Digital Camera with Linux (60 tags)

mdadm: A New Tool For Linux Software RAID Management (59 tags)

Asterisk: A Bare-Bones VoIP Example (43 tags)

View All

development

Articles that share the tag development:

Rolling with Ruby on Rails (579 tags)

What Is Web 2.0 (129 tags)

Ajax on Rails (119 tags)

Very Dynamic Web Interfaces (97 tags)

Understanding MVC in PHP (64 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com