O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Automating PostgreSQL Tasks

by Manni Wood
12/09/2004

You can and should automate routine database tasks, to remove the possibility of user error and to let yourself focus on other, more interesting work. If you use PostgreSQL on Unix or Linux, you can combine the shell with PostgreSQL's psql client to automate database chores.

Automating Database Creation

If you haven't yet created your database and you believe that you will be iteratively destroying and re-creating it as the project's needs become clearer, the first chore you should automate is the creation of the database itself.

I have the habit of creating a db_creation_scripts directory for my PostgreSQL projects, numbering the scripts inside according to the order in which they should execute. Here's a listing for the sample database I created to write this article:

prospero:db_creation_scripts$ ls
0_create_database_and_user.postgresql  1_create_main_tables.postgresql

A real project might have a few more scripts for pre-populating some of the tables or creating stored procedures, but our toy example is simple and needs just two scripts.

The 0_create_database_and_user.postgresql script is really a set of steps to follow manually. Take a look at it if you're interested in learning how I created the toy database used for the rest of this article.

The fun automation stuff begins with 1_create_main_tables.postgresql:

/* $Revision: 1.24 $ $Date: 2004/04/30 20:17:41 $
 
To create the bookstore database, run this file through psql like so:

prompt> psql -f 5_create_main_tables.postgresql bookstore bookstore_user
 
*/
 
drop sequence sales_row_num_seq;
 
create sequence sales_row_num_seq cycle;
 
drop table sales;
 
create table sales (
  row_num int not null default nextval('sales_row_num_seq'),
  first_name varchar(35) not null,
  middle_name varchar(35) null,
  last_name varchar(35) not null,
  city varchar(35) not null,
  state char(2) not null,
  zip varchar(5) not null,
  email_address varchar(50) not null,
  product_id int not null,
  product_name varchar(50) not null,
  product_sale_date date not null);
 
comment on table sales is 'Contains records of books sold from the bookstore.';
comment on column sales.row_num is 'For de-duping purposes.';

As the comment at the top of the file instructs, cd to the directory holding the file and run

$ psql -f 1_create_main_tables.postgresql bookstore bookstore_user

to create your database table.

Related Reading

Practical PostgreSQL
By John C. Worsley, Joshua D. Drake

If you decide that you should have broken the sales table into customer, product, and sales tables—certainly better relational design—just alter the file above and rerun the psql command to rebuild your database. This sort of automation is a real boon to the iterative design process early in a project. It also documents your database, and it makes it easy to re-create your database on a different host.

If you use CVS, you'll notice revision and date tags at the top of 0_create_database_and_user.postgresql. This is because I like to check my database creation scripts into CVS, or, lately, Subversion. It's a great way of keeping track of my database schema as it evolves, and it makes me more willing to experiment, knowing that I can recall earlier versions of the schema. It doesn't matter which version control system you use, of course. What matters is that you can.

Note that because I suggest in 0_create_database_and_user.postgresql to allow bookstore_user to log on without a password, the psql command runs without prompting for a password. Obviously, I have traded security for ease of use, especially for illustrative purposes in this article. You need to decide which balance you will strike.

Report Generation

When your database project is mature, you will likely be generating reports, often in formats that your clients or bosses can open in their spreadsheet programs. (The populate.postgresql file contains sample data that you might find useful while following along.)

If you want to capture the results to a file, use psql commands (they start with a backslash) in conjunction with SQL commands. You can log on to your database with psql bookstore bookstore_user and then execute the following commands:

\o report.txt
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

The resulting report.txt file, though easy on the eyes, is difficult to import into spreadsheets. You know that's what your bosses want.

Again, run psql bookstore bookstore_user. Then execute these additional formatting commands to produce something that will open in Excel or OpenOffice.org:

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

If you want to study more of psql's formatting commands, read the psql man page. There's even support for HTML table output!

Who wants to type all these commands every month just to generate a monthly sales report, though? Put all of the above commands into a file called generate_monthly_sales_report.postgresql:

-- generate_monthly_sales_report.postgresql
-- 
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date *gt;= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

Now you just have to remember to run generate_monthly_sales_report.postgresql using psql's -f flag. You could even just make a nice wrapper script for that:

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql -f generate_monthly_sales_report.postgresql sales_db sales_user

Of course, there's still the issue of having to edit the dates in generate_monthly_sales_report.postgresql, and wouldn't it be nice to be able to name the output file as well?

As it happens, psql supports the setting of named variables from the command line. These appear as :varname in SQL statements. Modify generate_monthly_sales_report.postgresql like this:

-- generate_monthly_sales_report.postgresql
-- 
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= :start_date and product_sale_date <= :end_date;
\o

Then modify gen_sales_report.sh:

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql --set start_date=\'2004-10-01\' \
     --set end_date=\'2003-10-31\' \
     -f generate_monthly_sales_report.postgresql sales_db sales_user

Note the escaping of single quotes to preserve them for the SQL query, where the dates will need them.

Of course, really we've just moved the problem to the wrapper shell script: we still have to edit the dates, it's just that now we're doing it in gen_sales_report.sh instead of generate_monthly_sales_report.postgresql. The true solution is to grab arguments from the command-line to reuse both scripts without ever having to edit them again.

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql --set start_date=\'$1\' \
     --set end_date=\'$2\' \
     -f generate_monthly_sales_report.postgresql sales_db sales_user

If you want to make your shell scripts really robust (and I encourage you to do so), you can write some validation code for $1 and $2 instead of trusting that the user (a) entered both command-line arguments and (b) entered correctly formatted dates for use in the SQL command. There's a fantastic book from No Starch Press called Wicked Cool Shell Scripts, by Dave Taylor, that has some great examples of argument validation using the shell. The book has raised my standards for shell script robustness. I strongly recommend it.

Data Feeds

Let us pretend that our toy database has a products table that has the following definition:

drop sequence products_row_num_seq;
 
create sequence products_row_num_seq cycle;
 
drop table products;
 
create table products (
  row_num int not null default nextval('products_row_num_seq'),
  id int primary key,
  name varchar(35) not null,
  description varchar(1024) null,
  price numeric(8, 2) not null);
 
comment on table products is 'Records of products sold at the bookstore.';
comment on column products.row_num is 'For de-duping purposes.';

Let us also pretend that for each new product that we sell, the marketing department provides us with a file in a standard format. We use that file to insert new products into the products table.

There are multiple approaches for inserting new data into a database.

One approach is to write software to parse the input file, and then use a connection library like Perl DB or JDBC to update the database. Many people do it this way, but as a matter of preference I don't much enjoy it. Configuring Perl DB or Java's JDBC always takes more time than I expect, and when the inevitable errors appear in the data feed, it's time-consuming to track down the source of the problem. (Think nested exceptions and tracking the line number of the input file.)

My favorite solution is to turn the data into a series of SQL insert or update statements that I put into a file. In turn, I feed that file to psql using its -f argument. This proves to be a very robust way of doing data input: psql already knows how to connect to PostgreSQL (so you don't have to tinker with Perl's DBI or Java's JDBC), and psql will report the exact line number of your SQL input file whenever an error occurs.

Also, it is trivial to add BEGIN/COMMIT SQL commands at the top and bottom of your input file to make the entire batch of inserts or updates a single transaction. This way, whenever an error occurs on line 3000 of your 5,000-line SQL command file, the whole transaction aborts (with a helpful error message, no less!), allowing you to fix the offending line and simply run the entire file again, confident that the last, successful run will be the only one that actually affects the database.

Let's pretend that the marketing department types up new product descriptions in a spreadsheet and saves to them to comma-separated value (.csv) files, which you must upload to the database. Here's a small sample file, products.csv, consisting of new products to add to the products table:

"id","name","description","price"
1001,"paperclips","box of 200",0.99
1002,"pencils","box of 10",1.99
1003,"birthday card",,2.99

The goal is to turn the .csv file into file called runme.postgresql containing commands like:

\set ON_ERROR_STOP 1
begin;
insert into products (id, name, description, price) values 
     (1001, 'paperclips', 'box of 200', 0.99);
insert into products (id, name, description, price) values 
     (1002, 'pencils', 'box of 10', 1.99);
insert into products (id, name, description, price) values 
     (1003, 'birthday card', null, 2.99);
commit;

I created the above file using the create_input_sql.pl Perl script, which you can use as inspiration for your own data upload scripts. The program is not as important as the idea that we can create a file of commands to feed to psql.

Our command file has two useful commands in addition to the insert statements. The first is where I set ON_ERROR_STOP to 1 (making it true), so that if there is a bad insert statement in the command file, psql will stop executing the rest of the file. The second is the set of begin/commit statements that wrap all the insert statements. This allows the update to execute as a single transaction, as discussed earlier.

Naturally, a shell script seems like a good way to pull this all together:

#!/bin/sh

# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.

./create_input_sql.pl products.csv  # outputs runme.postgresql

psql -f runme.postgresql sales_db sales_user

One enhancement I recommend to the script, when used in the real world on thousand-plus-line files, is to pipe all of psql's output to a log file so that you have a record of what happened. Also, I've noticed while running my own considerably large input files that psql runs much faster when its output goes to a file rather than to a terminal.

#!/bin/sh

# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.

./create_input_sql.pl products.csv  # outputs runme.postgresql

psql -f runme.postgresql sales_db sales_user 1> load_new_products.log 2>&1

"Can This Be Automated?"

I hope you found these automation samples useful. If this article has done its job, next time you have to do a repetitive task with PostgreSQL, you will ask yourself, "Can I automate this?" and you will have these samples to help you start.

Manni Wood leads teams at a Boston advertising company in building Java-based, database-backed web sites for clients like General Motors and FedEx.


Return to ONLamp.com.


What automations do you use? Contribute to a community toolkit here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Main Topics Oldest First

Showing messages 1 through 2 of 2.

  • Some other helpful hints
    2004-12-12 17:37:48  pspinler [Reply | View]

    Good stuff. We do very similar stuff in our shop. Here's a couple of additional, possibly useful hints which we do:

    1) Make a script per entity.

    Assume that you're maintaining an already existing application, and you have to make partial changes to the database. A script per entity will allow you to change one entity at a time.

    2) Use a revision control system. Using CVS, or better, something like SVN or Darcs on the scripts will allow you a lot more flexibility.

    If you have e.g. a seperate development, test, and production database for an application, where you start a change in development, then migrate that change to test, and finally to production, here's a revision control system really shines.

    2.a) use tags. Tag specific script revisions with the environment, e.g. "cvs tag -r 1.6 TEST foo_table.sql"

    Then, to reconstruct the test database, pull everything with a TEST tag. When the next change gets migrated in, move the TEST tag to that revision of the script, DEVELOPMENT tagged scripts are what you'd use to rebuild development, etc.

    3) Don't be afraid to use the database provided load/unload programs to create your scripts and data files. For instance, "pg_dump -a -t <table>" will dump only data for table, and "pg_dump -s -t <table>" will dump only schema for table.

    Why, you ask ? Well, then you can let developers or other users make changes to schema or data, and after you review them, you can dump them to a file, and start migrating them to test or production databases.

    You could even set up a web or script based set of functions to dump out schema, check it into cvs/svn, and tag it; then another set of scripts / web pages to check out a set of specific tags, and migrate it to a new database or rebuild a development database.


    Anyway, hope folks find this interesting. good luck.

    -- Pat
  • postgres 8.0 imports CSVs natively
    2004-12-11 08:49:55  oicu [Reply | View]


    Note that in version 8.0 of postgresql, about to be released, you can import a CSV file directly using the COPY command.


Tagged Articles

Post to del.icio.us

This article has been tagged:

postgresql

Articles that share the tag postgresql:

Managing Many-to-Many Relationships with PL/pgSQL (17 tags)

Writing PostgreSQL Functions with PL/pgSQL (11 tags)

Datamining Apache Logs with PostgreSQL (8 tags)

Introducing Slony (8 tags)

Batch Updates with PL/pgSQL (8 tags)

View All

postgres

Articles that share the tag postgres:

Writing PostgreSQL Functions with PL/pgSQL (8 tags)

Introducing Slony (5 tags)

Automating PostgreSQL Tasks (4 tags)

Modifying Slony Clusters (3 tags)

Managing Many-to-Many Relationships with PL/pgSQL (3 tags)

View All

database

Articles that share the tag database:

MySQL FULLTEXT Searching (54 tags)

Live Backups of MySQL Using Replication (53 tags)

Advanced MySQL Replication Techniques (53 tags)

Dreaming of an Atom Store: A Database for the Web (49 tags)

How to Misuse SQL's FROM Clause (38 tags)

View All

automation

Articles that share the tag automation:

Automating Windows Applications with Win32::OLE (19 tags)

Home Automation with Mac OS X, Part 1 (15 tags)

Hacks for Smart Homes (14 tags)

Macintosh Home Monitoring (13 tags)

Automating Stylesheet Creation (9 tags)

View All

programming

Articles that share the tag programming:

Rolling with Ruby on Rails (1374 tags)

Very Dynamic Web Interfaces (279 tags)

Ajax on Rails (231 tags)

Understanding MVC in PHP (202 tags)

A Simpler Ajax Path (186 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com