Generating Database Server-Side Cross Tabulations
Pages: 1, 2
Side-by-side, Multiple Operation Cross-Tabulations
A crosstab can do much more. When I worked for a large international
organization, I was often asked to provide counts, summaries, and averages
in the same table. If you are doing this manually, it just means
cutting and pasting the column calculation lines in the query, replacing
COUNT with SUM or whichever function is suitable
(MIN, MAX, AVG, STD) and
adding an appropriate item to the column name.
Here is an example of such a table.
| country | location | count | sum | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pers | sales | dev | total | pers | sales | dev | total | ||||||||||||||
| f | m | total | f | m | total | f | m | total | f | m | total | f | m | total | f | m | total | ||||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 5,500 | 0 | 5,500 | 0 | 0 | 0 | 0 | 6,000 | 6,000 | 11,500 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5,000 | 5,000 | 0 | 0 | 0 | 5,000 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 5,500 | 0 | 5,500 | 0 | 0 | 0 | 10,500 | |
| Total | 1 | 1 | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 5 | 5,500 | 5,000 | 10,500 | 5,500 | 5,000 | 10,500 | 0 | 6,000 | 6,000 | 27,000 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 6,000 | 6,000 | 0 | 0 | 0 | 6,000 | 0 | 6,000 | 12,000 |
| Total | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 6,000 | 6,000 | 0 | 0 | 0 | 6,000 | 0 | 6,000 | 12,000 | |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 0 | 5,500 | 5,500 | 0 | 0 | 0 | 10,500 |
| Total | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 0 | 5,500 | 5,500 | 0 | 0 | 0 | 10,500 | |
| Total | Total | 1 | 3 | 4 | 1 | 2 | 3 | 1 | 1 | 2 | 9 | 5,500 | 16,000 | 21,500 | 5,500 | 10,500 | 16,000 | 6,000 | 6,000 | 12,000 | 49,500 |
Of course, we keep adding more and more lines to our query to achieve our goal. Even for our simple example database, using a three-level column header with all the available functions creates a query with more than 200 columns. That's not something you'd like to do manually.
Column-Value Interdependency
There is more to the crosstab complexity. The method we've seen for
combining multiple-level column values won't work well if two or more levels
are linked together by some dependency. Let's suppose we want to partially transpose one of the examples above, using the department and category as row
headers, and country and location as column headers. A brute-force permutation
of the values would produce impossible combinations, such as
Germany#Rome, Italy#London, or
UK#Munich, which are both a waste of calculation resources and a
nuisance in the result set.
For such cases, a different strategy is necessary. Instead of gathering the columns separately, we need to collect them all at once:
SELECT DISTINCT
country, loc AS location, gender
FROM
person
INNER JOIN locs ON (person.loc_id=locs.loc_id)
INNER JOIN countries ON (locs.country_id=countries.country_id)
ORDER BY
country, location, gender
+---------+----------+--------+
| country | location | gender |
+---------+----------+--------+
| Germany | Munich | f |
| Germany | Munich | m |
| Germany | Bonn | f |
| Germany | Bonn | m |
| Germany | Berlin | f |
| Germany | Berlin | m |
| Italy | Rome | f |
| Italy | Rome | m |
| UK | London | m |
+---------+----------+--------+
Each row in the result represents a combination of values to create a column in our query. If subtotals are required, then the result is first loaded into a tree, which is the main mechanism used by DBIx::SQLCrosstab to produce the necessary permutations with subtotals as needed.
With so many issues to consider, you can still do your queries manually. Perhaps, having good organizational skills and a powerful editor, you may be able to create your queries in a few minutes. However, if your application needs crosstabs on demand, as a data warehouse would, then you should consider using a specialized tool that can create queries at runtime.
Using DBIx::SQLCrosstab
To use DBIx::SQLCrosstab, you need to download it from CPAN and install it in your system. Its only prerequisite is Tree::DAG_Node. If you want some specialized output, then you may also optionally install YAML and Spreadsheet::WriteExcel.
A DBIx::SQLCrosstab object requires an argument containing the parameters to build the query, passed as a hash reference.
my $xtab = DBIx::SQLCrosstab->new($params)
or die "error building object ($DBIx::SQLCrosstab::errstr)\n";
For every method in the module that is returning something, you can check
the variable $DBIx::SQLCrosstab::errstr, which will contain the
reason for failure.
Templating Your Query
A few parameters are required to build our query. Let's create a set of them to build the crosstab seen in the above "Multi-level cross-tabulations" section.
my $params = {
dbh => $dbh, # a database handler
op => 'COUNT', # the operation to perform
op_col => 'id', # which column to count
from => qq{
person
inner join locs ON (locss.loc_id=person.loc_id)
inner join countries ON (countries.country_id=locs.country_id)
},
# ...
};
The database handler identifies the database where we operate.
op is the operation to perform, such as COUNT,
SUM, or AVG. op_col is the column to
count. It could be any non-null column for a count, or an appropriate numeric
column for summaries and averages. The from item is the same thing
you would indicate in the FROM clause in a SQL statement. You
need to indicate any INNER or OUTER join as
necessary, so that all of the values you later indicate for columns or row headers
are properly referenced. Next comes the definition of rows and column headers.
Each one is an array reference, as follows:
my $params = {
# ...
rows => [
{ col => 'country', alias => 'country' },
{ col => 'loc', alias => 'location' }
],
};
For each header, only one item (col) is mandatory, while
alias is just for readability purposes. Let me remind you that such
columns are related to the from item, which must have the
appropriate JOIN parts to include the table containing such
columns.
$params = {
# ...
cols => [
{ id => 'dept_id', value => 'dept', from => 'depts' },
{ id => 'cat_id', value => 'category', from => 'categories' },
{
id => 'gender',
from => 'person'
col_list => [
{ id => 'f', value => 'f' },
{ id => 'm', value => 'm' }
],
}
],
};
For column headers, the mandatory fields are id and
from. Any value is used to create the final column
name. You may notice that while depts and categories are in the
column list, they are not in the main from description. This is an
optimization that is further described in "Inner JOINs" later. The description
of gender has another peculiarity. It provides a list of values to
be used by the engine without querying the database. This is also an
optimization, useful for such cases when you know the values in advance (for
example, when a CONSTRAINT in your query limits the values to a
given set).
Now, to get the exact result as our example below, we need to add a few optional parameters:
$params = {
# ...
col_total => 1, # adds a total column at the right end of the table
col_sub_total => 1, # adds sub-totals columns where appropriate
row_total => 1, # adds a grand total at the bottom end of the table
row_sub_total => 1, # adds sub-totals rows where appropriate
remove_if_zero => 1, # remove any column with all zeroes
add_colors => 1, # display the table with default colors
commify => 1, # adds thousand-separating commas to numbers
};
Column-Header Choice Strategies
Spending a few minutes at the drawing desk before actually writing your query parameters could spare you time and headaches afterwards. If you are at liberty to choose, try to designate those columns with larger numbers of values as row headers, since the DBMS will handle them better. Assigning them as row headers will increase the number of final columns to generate, requiring more calculation effort and making the table less readable. It's also worth it to assign columns as row headers when the column depends on other columns. Although the module can handle these cases just fine, the strain on the DBMS is higher, because in several cases, you need to query the main data set twice: once to find the value and once to get the final results.
Sometimes, though, you need a JOINed query to find the
appropriate values and save resources. Consider the case of a lookup table for
categories, where you have a few hundred values. If you only need to use a
dozen, it would be better to define the column header with a from
clause, such as:
$params = {
# ...
cols => [
{
id => 'cat_id',
value => 'category',
from => qq{
categories
INNER JOIN person
ON (categories.cat_id=person.cat_id')
}
},
# ...
],
};
This way, only the relevant values are chosen, sparing the DBMS engine the useless calculation of a few hundred unused categories.
When the values of a column header come from the main data set, you may
consider running the query once, finding the distinct values, and then
providing them as a col_list for the subsequent runs. This
practice is also useful when you only need to crosstab a specific subset.
Fine-Tuning Your Query
There are many parameters that can be passed to your DBIx::SQLCrosstab object, and you can use them to keep control of what is going to happen.
The most important options are where and having.
They will limit the recordset to the records you want. There is no limitation
to what you can require in these options, provided that they are legal SQL
syntax that apply to the tables being queried. You may want, for instance, to limit your query
for employees to the ones hired in the last month. If there is a field
identifying the start of contract date, and your database engine supports date
calculation, you can add a where clause saying "start_of_contract
>= now() - interval 30 days" or simply "start_of_contract >=
'YYYY-MM-DD' ", depending on the particular SQL dialect of your DBMS. With
having clauses, you can limit the result set by stating which
criteria the aggregated values must have, so you can say something like "having
=> pers > 10000" to define that you want only those rows where a sum of
salaries for pers is bigger than 10000.
$params = {
# ...
where => 'start_of_contract >= now() - interval 30 days',
having => 'pers > 10000',
};
When defining the column headers, we have already seen that we can explicitly define a list of values. Or we can explicitly exclude one or more values, to instruct the engine to retrieve all of them from the database, minus the ones that we list:
$params = {
# ...
cols => [
{ id => 'dept_id', value=> 'dept',
exclude_values => [ 'dev', 'pers' ]
},
#...
]
};
If we modify the previous request with this definition for the
dept header, the engine will only retrieve values for
sales department. Another method of excluding columns is to use
the general option col_exclude, which removes one column from the
query. You need to figure out how the query names would be created, and then
you may say:
$params = {
# ...
col_exclude => [ 'pers#contractor#f', "sales#consultant#m" ]
# ...
};
These rows were present in our example query, but their results were thrown out because all the values were zeroes. If we know that for sure, then we can improve the query execution speed by removing them in the first place.
To exclude a value for a row header, use a where clause. For
example, "where => country != 'Germany'".
Keeping an Eye on Performance Issues
When using a wrapper module to develop database applications, you usually trade ease of use for performance. DBIx::SQLCrosstab is not a true wrapper, because it is not a replacement for the whole database interface. Instead, it's a query builder — a very specific one, but a builder nonetheless. Most builders are peculiar in that they create queries according to some general rules, and the resulting statements often perform poorly. This module is no exception. You can use all of the options offered, but you need to understand the possible drawbacks.
A sensitive issue surrounds subtotals. Regarding column headers, a subtotal is not much of a problem, since it is only a few more calculated fields in the query. If your query grows to several hundred columns, though, it may become a problem. Keep this in mind. Also consider the purpose of your result. If you need to feed the result to a chart generator, you probably don't need subtotals. They could be misleading, anyway, if they are treated as normal columns and therefore generate wrong sums. The same principle applies if you want to export your results to a spreadsheet to perform further calculations.
Row totals and subtotals are a different case. While the same caveats apply
when you want to export to a chart, the performance penalty of calculating
subtotals can be huge, because DBIx::SQLCrosstab generates a
UNION query for each row header (if there are more than one), plus
another UNION query for the grand total. Any decent database
server can handle this task for a few hundred thousand records. When it comes
to millions, however, you may be kept waiting for much longer than you
expected. If you have very large databases, make some measurements (even better, do it when normal users are disconnected), and decide if you want to trade waiting time for the additional hassle of dealing with the subtotals outside of the server. I
have successfully tested subtotals in a four-million-record database, and the
results came in acceptable times. It could be that the design of my database is
better than average, or my DBMS is inherently optimized for these tasks. Either
way, I can't predict the effects of using subtotal queries on a huge database
with ten or twenty million records.
I'd like giving a piece of general advice for dealing with large
crosstabs, and this is to index all of the columns involved in the operation -- both the headers and the one being calculated. The optimization in most database
engines can take advantage of indexes in most of the operations related to
cross-tabulation: DISTINCT queries, GROUP BY clauses,
summaries, and JOINs.
Hidden JOINs
In the example given when explaining the parameters, I mentioned an internal
optimization in DBIx::SQLCrosstab to minimize the number of JOINs
necessary to perform the query. This feature is related to column headers whose
values come from a lookup table. If you indicate a key column that is in both
the main table and the lookup table, then you don't need to join the table in
the final query.
Let me explain better. In our example, we used a column header description for departments including:
$params = {
# ...
cols => [
{id => 'dept_id', value ='dept' },
# ...
],
};
The resulting line in the query is:
COUNT(CASE WHEN dept_id = 1 THEN id ELSE NULL END) AS 'pers',
Since dept_id is in both person and
depts, you don't need to add depts to your
from clause, because the values for depts are already stored in
the crosstab engine. I call this an "hidden join," because it performs the same
link as a JOIN but without its burden.
On the contrary, if your definition was:
$params = {
# ...
cols => [
{id => 'dept' },
# ...
],
};
Then the resulting SQL would have been:
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers',
Since dept is not a column of person, the you
should have been forced to include depts in your from
clause, thus querying the table twice. Therefore, your final query, the
expensive one, would be slower. Think about this possibility whenever
circumstances allow it.
Building Multiple Column Headers at Once
When talking about column interdependency, I said that DBIx::SQLCrosstab can
deal with these cases appropriately. You can instruct the module to use this
method by adding a group option to the first column definition.
When this option is selected, the crosstab engine uses the from
clause of the first column definition to build the entire headers tree.
$params = {
# ...
cols => [
{
id => 'country',
from => q{
person INNER JOIN locs
ON (person.loc_id=locs.loc_id)
INNER JOIN countries
ON (countries.country_id = locs.country_id)
},
group => 1,
orderby => 'country, loc, gender',
},
{ id => 'loc', from => '1' },
{ id => 'gender', from => '1' },
],
};
When the crosstab engine finds this definition, instead of trying to
retrieve the column values separately, it will create a single query. Notice
that the from clause in the second and third column definitions is
just 1, as a placeholder, because only the first definition is
used for that purpose.
SELECT DISTINCT
country, loc, gender
FROM
person
INNER JOIN locs ON (person.loc_id=locs.loc_id)
INNER JOIN countries ON (locs.country_id=countries.country_id)
ORDER BY
country, loc, gender
It will use the result as a list of permutations, from which it will create the different combinations.
The drawback of this method is that you might possibly query the main data
set twice. You need to balance the benefits of having just the right column
calculation against calculating all of the combinations and then removing the
unwanted ones using the option remove_if_zero.
You can also use this method when there is no dependency between column headers, but you just found out that the engine is creating more combinations than necessary.
Exporting Results
The only useful results you can get from DBIx::SQLCrosstab are the query and the result set.
my $query = $xtab->get_query()
or die "Error $DBIx::SQLCrosstab::errstr\n";
my $records = $xtab->get_recs()
or die "Error $DBIx::SQLCrosstab::errstr\n";
Sometimes, this is just what you need. If you plan to use your query in a static environment, all you want is to create the query and then paste it into your application code. Or you may want to do some checking in the records and display what you want in a custom routine.
In most cases, though, you may need more than that. Therefore, instead of using DBIx::SQLCrosstab, you can use a derived class, DBIx::SQLCrosstab::Format, which has exactly the same interface, with additional capability to format your results in many different ways.
Thus, instead of creating the object as in the previous example, you'd do it using the child class:
my $xtab = DBIx::SQLCrosstab::Format->new( $params )
or die "...";
Then, you have the choice about the format of your results.
# first, be sure that everything is OK
if ( $xtab->get_query() and $xtab->get_recs() )
{
my $html = $xtab->as_html() or die "error ...";
print $html;
}
The method as_html() returns a well-formatted HTML table,
properly enriched with column and row spans, optionally colored with the colors
of your choice (see the module documentation for details).
HTML is not the only format. This module can also handle XML, YAML, XLS (Excel spreadsheet), and CSV formats, with the relative methods:
as_xml: produces a rich XML output. For a simpler one, you could just do:use XML::Simple; my $simple_xml = XMLout( $xtab->as_perl_struct('hoh'), keeproot => 1, noattr => 1 );as_yaml: requires YAMLas_xls: produces direct and transposed tables (requires Spreadsheet::WriteExcel)as_csv
Being a Perl module, it is also quite easy to export Perl data structure, so
you can use the method as_perl_struct() with one of the following
parameters:
lol: a list of lists (as would be returned from theget_recs()method)losh: a list of simple hashes (one key per column)loh: a list of tree-like hashes, with the tree from the column headers for each rowhoh: a hash of hashes, or a full result tree that can be easily passed to YAML or XML::Simple
The actual list of formats is over, but the possibilities are much wider. The module provides some handles that a skilled programmer can use to create more specialized views. I will describe these issues in another article.
To get an idea of the module potential, you can play with an interactive example, which can show you the parameters for complex crosstabs and some interesting output formats.
Then it's your turn. Do your own testing and put DBIx::SQLCrosstab to good use!
Thanks to Stefano Rodighiero (a.k.a. larsen) for proofreading this article and for his constructive criticism. Thanks also to chromatic for making a true article from a difficult draft.
Giuseppe Maxia is a QA developer in MySQL community team. A system analyst with 20 years of IT experience, he has worked as a database consultant and designer for several years. He is a frequent speaker at open source events and he's the author of many articles. He lives in Sardinia (Italy).
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 3 of 3.
-
Crosstabs are the playground for XML
2003-12-08 12:11:53 anonymous2 [Reply | View]
IMHO, Crosstabs can easily be done by XML. XML is very flexible at rows, columns, ...
The only way is how to create the XML.
I personally prefer Cocoon to do this. I'm an Open Source user but Cocoon runs as a servlet on Tomcat, so it's available for Windows.
I give a simple example from the Cocoon User Guide:
<page xmlns:sql="http://apache.org/cocoon/SQL/2.0">
<execute-query xmlns="http://apache.org/cocoon/SQL/2.0">
<query name="department" >
select id,name from department_table
</query>
<execute-query>
<query name="employee">
select id,name from employee_table where department_id =
<ancestor-value
sql:name="id" sql:level="1"/>
</query>
</execute-query>
</execute-query>
</page>
This would result in something like this:
<page xmlns:sql="http://apache.org/cocoon/SQL/2.0">
<rowset nrofrows="2" name="department"
xmlns="http://apache.org/cocoon/SQL/2.0">
<row>
<id>1</id>
<name>Programmers</name>
<rowset nrofrows="2" name="employee">
<row>
<id>1</id>
<name>Donald Ball</name>
</row>
<row>
<id>2</id>
<name>Stefano Mazzocchi</name>
</row>
</rowset>
</row>
<row>
<id>2</id>
<name>Loungers</name>
<rowset nrofrows="1" name="employee">
<row>
<id>3</id>
<name>Pierpaolo Fumagalli</name>
</row>
</rowset>
</row>
</rowset>
</page>
Glue this together with some XSLT, and you have your report.
More info: http://cocoon.apache.org
yves.vindevogel@implements.be
-
Crosstabs are the playground for XML
2003-12-08 14:23:57 gmax [Reply | View]
"IMHO, Crosstabs can easily be done by XML"
Are you talking about producing or representing them? Client-side or server-side? (And since we are talking about database servers, don't forget that web servers are database clients.)
Crosstabs are statistical reports. They manipulate data from a database.
The purpose of my article is about doing it server-side rather than client-side.
I don't understand where exactly XML enters the equation.
XML is a markup language. It can represent data, but AFAIK it can't produce it. I hope this is not a case of confusing databases with their media (http://www.dbazine.com/pascal2.html).
Anyway, if Cocoon can send a query to a database, it can also send a crosstab query. The relevant point is how to create the SQL query and then how to transform the result from the database into a hierarchical structure. Notice that, unlike a normal database recordset, a crosstab query has a tree structure at the top and one at its left. Each row is related to both structures at once.
I am not saying that this data can't be represented in XML (because you can see several examples of XML representation at http://gmax.oltrelinux.com/cgi_bin/xtab.cgi) but it is less trivial that it seems.
The bottom line is: XML can represent data, but to extract data from a database you need some help from a major language.
Cheers
gmax






Scott