Emulating Analytic (AKA Ranking) Functions with MySQL
Pages: 1, 2, 3, 4
MySQL now:
mysql>select *
-> from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
-> (select 1 + count(*)
-> from EMPLOYEES b
-> where b.DEPTNO = a.DEPTNO
-> and b.SAL > a.SAL) RANK
-> from EMPLOYEES as a) as x
-> where x.RANK <= 5
-> order by x.DEPTNO, x.RANK;
+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME | FIRSTNAME | SAL | RANK |
+--------+-------+------------+-----------+---------+------+
| 10 | 4751 | WOLFORD | KATHY | 7997.34 | 1 |
| 10 | 10517 | HARRIS | GARLAND | 7993.74 | 2 |
[snip]
| 60 | 4013 | BURMEISTER | HATTIE | 7973.04 | 4 |
| 60 | 1472 | LADD | JEFFREY | 7952.5 | 5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set (5 min 29.12 sec)
Exactly the same result (<blush>except for the slightly embarrassing line under the table</blush>.) Let's take a closer look at what we are doing. Our subquery that computes the rank is correlated: that means that it refers to the current row of EMPLOYEES, through both the DEPTNO and the SAL columns. We need them to compute how many employees in the same department earn a bigger salary than the current employee. In other words, we must fire the query for each of the 10,000 rows we inspect. Now, when I created my table, I defined EMPNO as the primary key, and created an additional index on LASTNAME, since I expected it to be a frequent search criterion. Neither DEPTNO nor SAL are indexed, in either database. Indexing DEPTNO, given the distribution, is pointless, because it isn't selective enough. Ten thousand times a full scan over a ten thousand row table: one hundred million rows inspected in all. Somewhat more visible than fourteen scans of a fourteen row table...
Let me be upfront about this, I dislike adding indexes. People often underestimate their impact on inserts, deletes, and (sometimes) updates, which can be very high. I don't want to fix a query and break the nightly batch program. There are also cases when I couldn't index. Suppose that instead of having departments, we have stockmarkets. Instead of having employees, we have securities, and instead of ranking salaries, we want to rank the difference between the current price and yesterday's closing price, as a percentage. There is a high update rate, and even if we wanted to, MySQL doesn't allow you to index a computed column. What would we do? Never mind, this isn't a real life assignment, just an ONLamp article. Let's just try to save face and index the two columns that we need to correlate the main query and the subquery.
mysql> create index tempidx on EMPLOYEES(DEPTNO, SAL);
Query OK, 10000 rows affected (0.14 sec)
Records: 10000 Duplicates: 0 Warnings: 0
When we run the query again, here is the bottom line:
30 rows in set (26.48 sec)
We have avoided having to commit hara-kiri (just) but compared to Oracle, it's still far from glorious. Shall we have to kowtow to the mighty Oracle? Let's not throw in the towel yet.
Getting the Top Values Efficiently, Take One
When we run into a performance problem, there is no such thing as RTFMing. We are trying to get with MySQL the same results as SQL extensions that Oracle has, and MySQL hasn't. Instead of using an otherwise squeaky clean SQL construct such as the correlated subquery in the select list, couldn't we find extensions that MySQL has and Oracle hasn't to help us? Perhaps it would be wise to restate the problem. We want to compare the employee in the current row to the employees in the same department. How would we do it manually? I don't know about you, but I don't think that I would compute for each employee in turn how many people make more in the same department. I'd rather establish a full list of employees in the same department, and sort them by salary (which is probably what Oracle does behind the scene); and I'd reuse the same list for each employee in the department instead of recomputing the list each time. What we need is a sorted list.
There is precisely a function that builds a list, GROUP_CONCAT, and interestingly it is a function that Oracle natively lacks (even if it is possible to create such a function as a user-written function). The MySQL documentation provides the full syntax as:
GROUP_CONCAT([DISTINCT] expr [,expr ...]
[ORDER BY {unsigned_integer | col_name | expr}
[ASC | DESC] [,col_name ...]]
[SEPARATOR str_val])
Have you noticed the order by passed as a parameter? Doesn't it ring a bell? There is of course a catch: the resulting string has a limited length (although adjustable through the group_concat_max_len variable), 1024 characters by default. I shall return to this topic in a moment, but since for the time being I am only interested by the first five items in the list (since I want the people who have the top five salaries) the default maximum length is much more than I need.
I now have found how to construct my sorted list. But how shall I determine the position of the employee described by the current row inside this list? We have another function (which once again Oracle lacks) that fits our requirements, FIND_IN_SET(str,strlist). You should take note that in this function, the separator used in strlist is required to be a comma, which means that we must use a comma (the default) as the separator in our call to GROUP_CONCAT(). So, if whatever we need to store in our list may contain a comma, we must make use of REPLACE() to substitute this comma for an innocuous character whenever we add an item to the list, or search an item in the list.
Properly equipped with these two functions, I can now write and run my query (after having dropped the index on DEPTNO and SAL to be in the same context as Oracle):
mysql> select *
-> from (select e.DEPTNO,
-> e.EMPNO,
-> e.LASTNAME,
-> e.FIRSTNAME,
-> e.SAL,
-> find_in_set(e.SAL, x.SALLIST) RANK
-> from EMPLOYEES as e,
-> (select DEPTNO, group_concat(SAL
-> order by SAL desc) SALLIST
-> from EMPLOYEES
-> group by DEPTNO) as x
-> where e.DEPTNO = x.DEPTNO) as z
-> where RANK <= 5
-> and RANK > 0
-> order by DEPTNO, RANK;
A word about the list we build: we haved simply concatenated the salaries by decreasing order, letting MySQL implicitly convert float values to character strings. Since we have about 1,600 employees per department, the default length of 1024 characters is grossly insufficient to build a complete list. However, as I have said previously, we don't need the complete list, we just need the first five values. Nevertheless, we must anticipate that salaries who are not in the top 120 or whereabout will fail to be stored, and therefore will not be found by the FIND_IN_SET() function (which will return 0.) Hence the necessity for an additional condition that RANK must be strictly positive. We get the same result as usual:
+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME | FIRSTNAME | SAL | RANK |
+--------+-------+------------+-----------+---------+------+
| 10 | 4751 | WOLFORD | KATHY | 7997.34 | 1 |
| 10 | 10517 | HARRIS | GARLAND | 7993.74 | 2 |
[snip]
| 60 | 4013 | BURMEISTER | HATTIE | 7973.04 | 4 |
| 60 | 1472 | LADD | JEFFREY | 7952.5 | 5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set, 1 warning (0.30 sec)
Same result, except that this time the performance, without being quite as excellent as Oracle's, is very decent. We get a warning, in punishment of our sin of trying to build lists that are more than 1K in length, but otherwise we get precisely what we want, with an SQL usage that is both relatively elegant and efficient. We have in the GROUP_CONCAT()/FIND_IN_SET() combination a kind of MySQL crypto-analytic function that enables us to answer almost all the "what are the top (or bottom) n" questions, as long as n isn't too large, and can be used to emulate several Oracle analytic functions. For instance, Oracle can handle both rank() and dense_rank(); the former focuses on what is ranked, and the latter on how it is ranked. Let's suppose that Tom, Dick, and Harry have obtained the following scores at a game:
| Tom |
825 |
|
Dick |
825 |
|
Harry |
750 |
Harry's rank() is 3, because rank() focuses on what is ranked, the people, and there are two people who have a better score than Harry. Both Tom and Dick have rank 1. However, dense_rank() focuses on how people are ranked, which is to say scores. And here, Harry's dense_rank() is 2, since there is only one score value higher than 750; "dense" refers to the fact that there is no gap in rank numbers. If you want to emulate dense_rank(), all you need to do is to add a distinct to the expression in group_concat(). And if you want to emulate row_number(), that assigns a unique number within the group defined by the partition clause, all you need to do is build a list using unique identifiers. In our example, keeping in mind that we accept that we don't need more values than what the longest string returned by group_concat() is able to hold, we can emulate:
row_number() over (partition by deptno
order by hiredate)
by using
FIND_IN_SET(EMPNO, EMPNOLIST)
with EMPNOLIST built as
GROUP_CONCAT(EMPNO ORDER BY HIREDATE) as EMPNOLIST
associated with the usual group by DEPTNO. A word of caution though, as with in our example with SUM(), any additional restrictive condition that Oracle would need only once will have to be added to the group by query as well with MySQL.
In my next article, we'll look at another way to do this same analytical query, as well as how to implement the LAG function and how to foresee values in queries.
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 8 of 8.
-
Flow sum()? have you any ideas?
2007-04-02 03:54:44 LionSoftware [Reply | View]
-
Flow sum()? have you any ideas?
2007-04-02 04:17:01 Stephane Faroult |
[Reply | View]
There is a part 2 coming :-). It should give you some ideas.
-
Very good article
2007-04-01 06:27:03 Giuseppe Maxia |
[Reply | View]
Hi.
Thanks for this very insightful article.
You said that you were using MySQL 5.0. Could you be more specific about the exact version?
If there is a public place where we can download the sample data used for your test, it would be greatly appreciated.
Best regards
Giuseppe Maxia
-
MySQL version
2007-04-02 00:24:19 Stephane Faroult |
[Reply | View]
If you want all the gory details the tests were performed with MySQL 5.0.27-standard and Oracle 10.2.0.1.0. Both were running on a double Xeon 2.4 GHz test machine under what initially was a Mandriva 2007.0 that is in a perpetual state of flux and that Oracle anyway believes to be a redhat-4.
Sample data available on request :-). -
MySQL version
2007-04-04 20:10:40 jaypipes [Reply | View]
Hi there! Great article, Stephane! Reminds me of the examples I put into Chapter 8 of "Pro MySQL"...
Couple things I wanted to note, however.
First, the number one reason you see such a performance difference with adding an index *isn't* actually the (correct) point you make about the subquery being fired once for each subset row. The reason is because you are using the InnoDB storage engine, and InnoDB has a known issue with doing a SELECT COUNT(*) when there is no WHERE condition on an indexed field. InnoDB is forced to do a full table scan, as you point out, but for a different reason, each time a SELECT COUNT(*) is used in this manner. See http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html for more info on that.
Second, is that you can get *much* better performance from your MySQL SELECTs by avoiding correlated subqueries and rewriting as a derived table... It's a bug in the MySQL optimizer unfortunately. Even better performance would be from creating a temporary table in place of the derived table and having an appropriate index on the temporary table. Unfortunately, a derived table is created internally with no indexes.
Here's your correlated subquery rewritten as a derived table. Let me know if you get much better performance from it, though I think the biggest performance benefit comes from adding an index because of the InnoDB limitation mentioned above:
original SQL:
select *
from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
(select 1 + count(*)
from EMPLOYEES b
where b.DEPTNO = a.DEPTNO
and b.SAL > a.SAL) RANK
from EMPLOYEES as a) as x
where x.RANK <= 5
order by x.DEPTNO, x.RANK;
Rewritten SQL:
select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL, x.rank as RANK
from EMPLOYEES as a
inner join (
# Get the ranks of each employee within the dept
select e1.EMPNO, count(DISTINCT e2.SAL) as rank
from EMPLOYEES e1
inner join EMPLOYEES e2
on e1.DEPTNO = e2.DEPTNO
and e1.SAL <= e2.SAL
group by e1.EMPNO
) as x
on a.EMPNO = x.EMPNO
order by a.DEPTNO, RANK;
Do the comparison after adding that index, of course. :)
Cheers,
Jay Pipes
jay at mysql dot com -
MySQL version
2007-04-04 19:51:13 jaypipes [Reply | View]
Hi there! Great article, Stephane! Reminds me of the examples I put into Chapter 8 of "Pro MySQL"...
Couple things I wanted to note, however.
First, the number one reason you see such a performance difference with adding an index *isn't* actually the (correct) point you make about the subquery being fired once for each subset row. The reason is because you are using the InnoDB storage engine, and InnoDB has a known issue with doing a SELECT COUNT(*) when there is no WHERE condition on an indexed field. InnoDB is forced to do a full table scan, as you point out, but for a different reason, each time a SELECT COUNT(*) is used in this manner. See http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html for more info on that.
Second, is that you can get *much* better performance from your MySQL SELECTs by avoiding correlated subqueries and rewriting as a derived table... It's a bug in the MySQL optimizer unfortunately. Even better performance would be from creating a temporary table in place of the derived table and having an appropriate index on the temporary table. Unfortunately, a derived table is created internally with no indexes.
Here's your correlated subquery rewritten as a derived table. Let me know if you get much better performance from it, though I think the biggest performance benefit comes from adding an index because of the InnoDB limitation mentioned above:
original SQL:
select *
from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
(select 1 + count(*)
from EMPLOYEES b
where b.DEPTNO = a.DEPTNO
and b.SAL > a.SAL) RANK
from EMPLOYEES as a) as x
where x.RANK <= 5
order by x.DEPTNO, x.RANK;
Rewritten SQL:
select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL, x.emp_count + 1 as RANK
from EMPLOYEES as a
inner join (
select count(*) as emp_count
from EMPLOYEES
) as x
on x.DEPTNO = a.DEPTNO
and x.SAL > a.SAL
where RANK <= 5
order by a.DEPTNO, RANK;
Do the comparison after adding that index, of course. :)
Cheers,
Jay Pipes
jay at mysql dot com -
MySQL version
2007-04-06 10:16:23 Stephane Faroult |
[Reply | View]
Jay,
Where did I say I was using InnoDB? I didn't want to compare Oracle to Oracle ;-). Actually, as my point was a purely SQL point, I haven't specified any storage engine - which means that my table is of the default MyISAM type. That said, the rewriting you suggest is indeed interesting. I have used the code of your second rewriting, added the missing condition on RANK and ran it twice, one without the index, and one with the index. There is a noted improvement WITHOUT the index on the query with the subquery in the SELECT list, as could indeed be expected:
30 rows in set (1 min 48.16 sec)
With the index, though, it's quite comparable:
30 rows in set (33.57 sec)
The problem with the index is the way I have generated my data ( ... randomly). As a result, employees from a same department are spread all over the table. We might expect greater efficiency if employees were physically clustered by DEPTNO (by "clustered" I mean of course no more than "if the rows were close from each other").
-
Great article!
2007-04-01 01:05:59 fauigerzigerk [Reply | View]
That's very useful. I just wonder why the rank query took 26 seconds after you added the index. Even if every single person had a different salary, which is unlikely for a real payroll, there would be no more than 10000 entries in that index. That means we need 13 steps to find an entry (in a btree) and then at most 4 traversal steps in the tree to find the rank. So we're talking about a mere 170,000 steps which should definately result in a a sub-second runtime. I suspect the optimizer doesn't use the where criteria to cut short the tree traversal.





The idea to repeat analytic SUM() OVER (PARTITION) is GREAT! But I have no ideas how to repeat SUM() OVER (ORDER BY) on MySQL. Have you any ideas?
Because I trying to repeat this:
Table transactions:
id amount
1 100
2 200
3 -50
4 -10
select id, amount, sum(amount) over (order by id) flowsum from mytable;
id amount flowsum
1 100 100
2 200 300
3 -50 250
4 -10 240