2011-08-16

How to Quickly Create a Histogram in MySQL

This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.
There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise.
This is how you do it:
SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;
Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.
This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.
+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+
Some notes:
Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

2011-08-10

How to Kill All Connections to MongoDB

This post is similar to the previous one I have posted before for MySQL. Only this time, it's for MongoDB.
What's so nice about MongoDB is that the command line is actually a fully fledged javascript console. So you can do this:
(the embedded script is not visible from the RSS feed though, so click through to the post itself)
The function accepts two variables, both optional.
The first one is the namespace for which you'd like the connections to be working with. It's optional, but you'd usually want to specify it. If you don't, you'll kill all the connections, including the ones that perform replication duties.
The second one is a "wet" run parameter, that if set to true actually does the killing. The default is false, thus not doing anything but print the connections to be killed to your console.

The whole thing is a work in progress, plus my javascript skills are far from perfect, so comments here or on github are welcome.
A potential issue I've already noticed is that when some operations are in a yeild state, the namespace's first letter is prefixed with a question mark "?". In that case, the simple namespace filter will probably miss it. But, I didn't want to make the function any more complex than it already is.

2011-08-08

How to Easily See Who's Connected to Your MySQL Server

I'm posting this here since it has been useful for me, and the blog is a nice place to keep public notes.
If you have servers which have multiple application servers connected to them, you often need to see things like who's connected, how many connections they have, and which users. Using SHOW PROCESSLIST doesn't work that well, since it gives you a row for each server.

What we want is an output similar to this:
+-----------------+-----------------+----------+
| host_short      | users           | count(*) |
+-----------------+-----------------+----------+
| slave1          | repl            |        1 |
| slave2          | repl            |        1 |
| localhost       | event_scheduler |        1 |
| 111.111.222.111 | root, foo       |        2 |
| 111.111.222.222 | appuser, bar    |        3 |
| 111.111.222.333 | appuser, moshe  |        9 |
+-----------------+-----------------+----------+
And it is achieved using a simple query such as this one:

SELECT SUBSTRING_INDEX(host, ':', 1) AS host_short,
       GROUP_CONCAT(DISTINCT USER)   AS users,
       COUNT(*)
FROM   information_schema.processlist
GROUP  BY host_short
ORDER  BY COUNT(*),
          host_short;

A final note: I'm not sure what version of MySQL this query needs to function, but it works great on MySQL 5.5, and should work just as well on MySQL 5.1.

2011-08-06

How to Move a MongoDB Collection Between Databases

If you need to move a collection between two MongoDB databases, there is no need to dump and restore your data using mongodump / mongorestore.

In a way, it's pretty similar to what we can do with  MySQL's "RENAME TABLE" command:
db.runCommand({renameCollection:"sourcedb.mycol",to:"targetdb.mycol"})
In the background, MongoDB will dump and restore it automatically. There is no metadata "magic" for such a rename since databases reside in different files on disk. It just saves a bit of work, and it's worth knowing about.

How to Fix Erroneously Named MongoDB Collections

In MongoDB, collections are not supposed to include certain characters like, for example, a hyphen "-" or a slash "/". The manual has this to say on the matter:
Collection names should begin with letters or an underscore and may include numbers; $ is reserved. Collections can be organized in namespaces; these are named groups of collections defined using a dot notation.
The keyword here is "should".
It is possible to create collations with names that do not adhere to these rules. This is not enforced at all. When you try to access such a collection from the shell, things don't quite work as expected:
>db.test-col.find()
Sat Aug  6 14:54:06 ReferenceError: col is not defined (shell):1
This is because the javascript console tries to subtract "col" from "test". I've tried different escaping methods for the rogue hyphen, but couldn't find one that works.
Eventually, I've reverted to using the more "internal" renameCollection command, to rename the collection into something manageable:
>db.runCommand({renameCollection:"test.test-col",to:"test.test_col"})
Which solves the problem completely.

Now you might be wondering: how was this created in the first place? Because if you can't read from this collection, you can't "save" into it from the shell as well.
We have a nice list of commands here which I recommend to anyone who's working with MongoDB to be familiar with. One of these commands is the aptly named "create", which will obediently create a collection with such a name:
>db.runCommand({"create":"test-col"});
{ "ok" : 1 }
I can only assume drivers use this same command to create the collections, which was how it was created in the first place.

2011-01-23

More Optimizer Bugs in MySQL

I'm proud to report of two bugs I've submitted to the MySQL bug tracker.
Both are seemingly simple and vaguely related, but have grave consequences when combined with large amounts of data.

The first one, Bug #50212 is a regression performance problem in 5.1, dealing with MySQL not using a primary key when it can, in order to avoid a file sort. The problem is made worse by the fact that even using a FORCE INDEX hint does not help things. The bug requires a join condition to be present to reproduce.
So if you have a large table, and you're expecting MySQL to select data from it ordered by the primary key and avoid a file sort, you're out of luck.
Of course, you can remove the ORDER BY and hope it's going to return in order, but it's not guaranteed.
There is no workaround, and this bug is in a verified state for a year now.

A more recent bug, Bug #59557 deals with another optimization issue, that takes place when MySQL is not willing to perform an index scan to complete a GROUP BY statement, when you reverse the order of the fields compared the their order in the index.

Since a GROUP BY statement does not care about the order of fields, it should not matter: the fields in the GROUP BY clause are essentially a set, and not a ordered list, as they are in ORDER BY.

An example to demonstrate the problem (taken from the bug report itself):
For this table:

CREATE TABLE `t` (
`a` INT(11) NOT NULL,
`b` INT(11) NOT NULL,
PRIMARY KEY (`a`,`b`)
) ENGINE=INNODB;
Run this:
mysql> EXPLAIN SELECT COUNT(*) FROM t GROUP BY b,a\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 1
Extra: Using index; Using temporary; Using filesort

What you get is: Using index; Using temporary; Using filesort
If you're a MySQL veteran, you know that any GROUP BY is implicitly a ORDER BY as well - and you'd guess that causes the issue.
Unfortunately, even if we add a ORDER BY NULL clause, the optimizer would still choose a wrong plan, like so:
mysql> EXPLAIN SELECT COUNT(*) FROM t GROUP BY b,a ORDER BY NULL\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 1
Extra: Using index; Using temporary
What you get now is: Using index; Using temporary
A bit less wrong, but still... wrong. If someone knows what the database tried to accomplish by using a temporary table here, I'd be happy to hear an explanation... :)

As I've said, I am proud to submit bugs and help the community, but I would be a bit more proud if bugs I've submitted a year ago would have been fixed by now. Just saying.