2008-05-31

Taking Cache Warm-up Time Into Consideration

Last week there there was a short scheduled downtime for the system, and I've used the chance to upgrade the MySQL servers to a newer version. We've tested it for more than a month before deploying to production, so I wasn't worried about any potential problems.

The upgrade itself went smoothly and was complete in about less than 10 minutes. Speaking of which, this is one of the things I like in MySQL upgrades as opposed to SQL Server.

  1. Copy new version
  2. Shut down the server
  3. Rename directories
  4. Start server
  5. Run mysql_upgrade
  6. Restart server

All of the above usually takes about 2 minutes, and can be reverted quite easily, if you have backup for the data files. This is compared to a SQL Server upgrade, or even a Service Pack install which includes running an installer for what can be 20 minutes or more. Uninstalling is equality lengthy, though both are accompanied by nice and useless progress bars.

After the upgrade was complete, we started the application and it wouldn't work - it was reporting timeouts. Why? Was there a problem with the upgrade? Some bug we haven't found out about? The circumstantial evidence pointed directly to the upgrade as the problem. After some digging by Pasha, the problematic query was found - it was a group by query updating some counter table. It ran in under 5 seconds before, but now it just wouldn't finish in the 30sec timeframe before timing out.

This immediately pointer to the real cause of the problem: the server restart. The buffer cache was flushed, and the index scanned by this query was not in memory. Fixing was easy, I ran the problematic query manually (it took almost 2 minutes to run). After this, the application worked again. Problem solved? Not really.

Queries of this sort should almost never exist in a system. This specific query was long since been noticed due to proactive slow-log monitoring. This problem was (ironically) fixed on the day of this incident, but the newer version was not yet deployed.

Conclusion: Think of how your application works with a cold database cache. If you need to warm it up, do so. A better path is to design the database more efficiently and avoid certain types of queries. In this case, the counter table was now updated on batches of inserts and deletes - instead of periodically.