O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

How to Misuse SQL's FROM Clause

by Stephane Faroult
09/30/2004

FROM what?

It may seem surprising to state it so, but the FROM clause of SQL statements seems to be one of the most often misused parts of SQL queries. Misused? How is that possible? We put into the FROM clause all the tables to join together in a query, don't we?

Well, well, well. Not quite. At the risk of sounding pedantic, perhaps a bit of (applied) theory would be welcome.

How to Misuse the FROM Clause

Relational theory calls relations what most practitioners know as tables--a list of unordered statements (the rows) about the various qualities (the column values) associated with some entities uniquely identified by the primary key. When you join tables, you derive other relations from the known relations, just like you often derive a theorem from other theorems. This is basically what joins are about, returning related data from different sources and creating new relations.

However, when you look closely, you very often realize that the returned data comes from only some of the tables listed in the FROM clause, typically something in the guise of:

select distinct a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a,
     ORDERS    b
where a.ZIP_CODE in ...
  and b.ORDERED_DATE >= ...
  and b.CUSTOMER_ID = a.CUSTOMER_ID
order by a.CUSTOMER_NAME

This is a very simple example of a pattern you meet over and over. Isn't this a very fine query, which does just what it asks?

In my view, this is a very bad query; it's logically flawed, which isn't very surprising, and its execution plan, execution time, and various statistics prove that it also performs badly. Poor performance is more often than not the direct consequence of poor logic. The good news is that it's possible to improve the speed of such a slow query significantly. (As anybody who has ever successfully tuned a SQL query can testify, this is an area where significant means an order of magnitude or two, not 20 percent).

Related Reading

MySQL Reference Manual
Documentation from the source
By Michael "Monty" Widenius, David Axmark, MySQL AB

Where is the logical flaw? As stated above, this is a typical query in which all the data comes from a single table: CUSTOMERS. Interestingly, it returns a CUSTOMER_ID. There's no prize for guessing that this is likely to be the primary key, uniquely identifying each row in CUSTOMERS. However, we need a DISTINCT; one customer may have ordered several times in the period of interest. In other words, we are fixing with DISTINCT a problem of our own design.

This is a "Penelope" type of query. Penelope, as told in Homer's Odyssey, was the wife of Ulysses, king of Ithaca. Hoping for the return of her husband from Troy, she kept suitors at bay by pretending that she would consider them once she completed her weaving--which she kept going by undoing at night what she had done during the day.

Like Penelope, we are doing a lot of work by executing a regular join and bringing back many rows, only to undo a large part of it in order to retrieve the resulting set. (The "best" examples of Penelope queries are usually met with queries involving complex views, however.)

An Improved Query

In this case, the ORDERS table doesn't belong to the FROM clause. It belongs to the WHERE clause, because it appears only for the purpose of filtering on the ORDERED_DATE column. Since it belongs to the WHERE clause, it's better as a subquery--either correlated, which means that it refers to the current row of the main query and therefore fires each time the check must be performed, or uncorrelated, which means that it can execute independently of the main query.

In our simple example, a correlated subquery would be:

select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
  and exists (select NULL
              from ORDERS b
              where b.CUSTOMER_ID = a.CUSTOMER_ID
                and b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME

while an uncorrelated subquery would be:

select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
  and a.CUSTOMER_ID in (select b.CUSTOMER_ID
                        from ORDERS b
                        where b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME

It may seem surprising, but on a very simple example such as this one, the RDBMS optimizer--that part of the code in charge of finding the most efficient way to execute the query--may sometimes perfectly decide to transform a subquery back into a join, for sound reasons that are beyond the scope of this article.

However, when it does this, the optimizer does it on purpose. It should be obvious that the really complicated case involves many tables. Then the complexity of singling out each possible rewriting increases exponentially. As the optimizing phase is part of the overall response time, what the optimizer tries as a fraction of what is possible falls dramatically when possibilities widen.

The result is that the more complicated the query, the higher the risk that the optimizer will take the wrong decision. That is when writing queries in a sensible way becomes all-important.

Correlated or Uncorrelated?

Because we have several ways to write the filtering subquery, we of course have to consider whether one is significantly better than the other.

Over the past 20 years, many SQL tuning books and articles have advocated the use of NOT EXISTS over NOT IN. (Interestingly, many such elements of popular technical culture have roots in what used to be true facts long ago but which are not necessarily true anymore.) True or not, this assertion has become in the mind of many a SQL developer "Use EXISTS, not IN." In some cases, this is justifiable, but it's not a good rule or even good practice.

As its name implies, a correlated subquery (the inner query) depends on an outer query that fires it. How often it fires depends on the circumstances.

The defining characteristic of a correlated subquery is that it refers to at least one value from the current row of the outer query--CUSTOMER_ID in our example above. If we have no other filtering condition, we will have to evaluate the correlated subquery for each row we inspect during the execution of the outer query. In that case, it will probably cost much more than the JOIN and DISTINCT, as a sophisticated optimizer will probably notice. If there are other, simpler filtering conditions to apply prior to this correlated subquery, we can fire it only for the candidate rows that have successfully passed all the other conditions. If those other conditions are highly selective, we may need to fire the subquery for only a handful of rows more than in our final resulting set, applying an efficient search criterion (once again, the CUSTOMER_ID) each time.

The uncorrelated subquery, on the other hand, doesn't refer to the outer query. As a result, it executes only once. The important question is how many rows this subquery will return. If it's a (relatively) small number, it is likely that the engine will use an index (on CUSTOMERS(CUSTOMER_ID) in our example). It is interesting to point out that in the correlated subquery case, it would have been ORDERS(CUSTOMER_ID). If, however, the subquery returns a really huge number of rows, the engine will use some kind of hash or merge join. This may prove much more efficient than a correlated subquery if the final, global result set is large, since a correlated subquery executes at least as many times as the number of rows returned, and possibly much more if the other criteria are not very selective.

To summarize, both correlated and uncorrelated subqueries may each in turn prove to be more efficient. The former wins when the final result set is relatively small and the other criteria are efficient. The latter works best when dealing with large tables from which you expect many rows. When in doubt, test with both.

Conclusion

In short, when trying to correct a query that returns duplicate rows, a quick fix of adding a DISTINCT is definitely the solution to avoid. Unfortunately, it is widely practiced. Searching the code for DISTINCT with joins is a way to improve code performance easily.

Remember that the FROM clause of a query should refer to only two types of tables:

  • those from which you want values returned
  • those allowing to join two or more tables in the above category

Any table that does not fall in either of those two categories should appear in a subquery. For instance, given a join such as:

...
FROM T1, T2, T3, T4, T5, T6
WHERE ...
...

the various links between the various tables may allow us to draw a diagram such as:

T1 - T2 - T3 -T6
      |
     T4 - T5

Assume that the data returned comes from T1 and T4. Those two tables have their place in the FROM clause, obviously. So does T2, since it allows the joining T1 to T4. The "end of link" tables, however--T3, T6, and T5--belong to subqueries.

These simple rules produce effective results when applied intelligently.

Stephane Faroult first discovered the relational model and the SQL language in 1983.


Return to the MySQL DevCenter


Have a question about rewriting a query? Ask Stephane here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Main Topics Oldest First

Showing messages 1 through 5 of 5.

  • Doesn't work for Oracle, Db2 UDB, DB2 or SQLserver
    2006-03-21 15:55:13  mm0 [Reply | View]

    I assembled sample data from my ssytem to test this out because I didn't believe it was generalizable. It's the opposite of the suggestion for Oracle 8/9 and sqlserver in my test databases, and DB2 and DB2 UDB rewrite the subquery as a join.

    Tables are indexed on constraint columns, indexed on join columns, cust table has ~25 million rows and order table has ~60 million rows.

    I don't have a mysql instance up on these servers to test with. I think you're making a generalization of a mysql-specific suggestion.
  • Uncorrelated subqueries
    2004-11-05 10:36:35  runrig [Reply | View]

    I've seen examples where it sped up the query when you used 'distinct' (or a 'group by' clause) in the uncorrelated subquery.
  • Another option?
    2004-10-22 08:35:21  solutions@netgis.com [Reply | View]

    Should a sub-query in the FROM clause be considered?

    select a.CUSTOMER_ID, a.CUSTOMER_NAME
    from CUSTOMERS a
    join
    (select b.CUSTOMER_ID
    from ORDERS b
    where b.ORDERED_DATE >= ...
    group by b.CUSTOMER_ID) b
    on
    a.CUSTOMER_ID = b.CUSTOMER_ID
    where a.ZIP_CODE in ...
    order by a.CUSTOMER_NAME
  • This only works w/ 4.1+
    2004-10-18 19:32:22  Purdy [Reply | View]

    Here's a quick reference for those of us stuck w/ <4.1.

    I'm on a Debian stable server, which typically would have 3.23.49, but I got <a href="http://www.dotdeb.org">an apt-source that brings in 4.0.21... one of these days!
  • Absolutely incorrect for Oracle
    2004-10-14 07:35:20  ARia [Reply | View]

    This tip is absolutely broken for example for Oracle.


Tagged Articles

Post to del.icio.us

This article has been tagged:

sql

Articles that share the tag sql:

How to Misuse SQL's FROM Clause (74 tags)

Configuring Database Access in Eclipse 3.0 with SQLExplorer (28 tags)

Managing Many-to-Many Relationships with PL/pgSQL (18 tags)

Hierarchical SQL (16 tags)

The Outer Limits of SQL JOINs (14 tags)

View All

database

Articles that share the tag database:

MySQL FULLTEXT Searching (54 tags)

Live Backups of MySQL Using Replication (53 tags)

Advanced MySQL Replication Techniques (53 tags)

Dreaming of an Atom Store: A Database for the Web (49 tags)

How to Misuse SQL's FROM Clause (38 tags)

View All

mysql

Articles that share the tag mysql:

MySQL FULLTEXT Searching (155 tags)

Live Backups of MySQL Using Replication (152 tags)

Advanced MySQL Replication Techniques (125 tags)

Ten MySQL Best Practices (59 tags)

Rolling with Ruby on Rails (56 tags)

View All

programming

Articles that share the tag programming:

Rolling with Ruby on Rails (1374 tags)

Very Dynamic Web Interfaces (279 tags)

Ajax on Rails (231 tags)

Understanding MVC in PHP (202 tags)

A Simpler Ajax Path (186 tags)

View All

tuning

Articles that share the tag tuning:

How to Misuse SQL's FROM Clause (16 tags)

Introducing LAMP Tuning Techniques (7 tags)

Speeding up Linux Using hdparm (5 tags)

Top Ten Web Performance Tuning Tips (3 tags)

Preventing Denial of Service Attacks (2 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com