2011-08-16

How to Quickly Create a Histogram in MySQL

This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.
There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise.
This is how you do it:
SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;
Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.
This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.
+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+
Some notes:
Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

5 comments:

  1. Awesomeness, thanks for sharing!

    Sometimes just knowing that something is possible makes your life easier - I might have assumed this wasn't possible without a lot of effort, so the thought to Google for it wouldn't have entered my mind. Now, I won't remember how to do it, but I'll remember it's possible, and will be able to find your post.

    ReplyDelete
  2. You're welcome!

    A funny thing I've thought about after publishing - the stars and numbers look very symmetrical. Thing is, I can make them completely symmetrical by using LOG10 instead of LN :)

    ReplyDelete
  3. Super smart. I didn't know you could round to a negative digit, makes perfect sense though!

    ReplyDelete
  4. No need to include the `bar` column—the `count` column already forms a logarithmic histogram! Look carefully at the example in the post: the shape of the `count` column is a mirror image of the `bar` column. This happens because the number of digits in a positive integer N is CEIL(LOG10(n)).

    If you want to make it a linear histogram, just do EXP(COUNT(*)) instead.

    ReplyDelete
  5. Wow. I'm really impressed by this, and it came in really handy just now. Thanks for posting it!

    ReplyDelete