As part of a larger database project Bitlancer was tasked with load testing a new MySQL instance. To follow are some notes and tips related to our experience.
The process and tools you decide to use for a load test can vary widely. We found that tools fell into one of two categories, though some tools fell into both categories: benchmarking software and replay software. The former includes tools that generate their own schema and queries that stress the system in specific ways. These tools typically monitor a series of metrics and generate performance scores for different parts of the system. In contrast, the latter set of tools ingest and replay your real query traffic and measure query performance. The advantage to using replay software is that you get feedback about real world performance, specific to your schema, queries, and data set.
For this particular project, executing a benchmark had little value. We were more interested in real world performance differences with our data set because we were migrating our client from Percona 5.5 to MariaDB 10.1, so some performance improvements were already expected.
The tool we chose to use for this load test was largely governed by our apprehension to modify the existing instance software or configurations. For example, there are numerous tools that can parse the slow query log or digest the performance schema; however, these tools require configuration changes or software installations. Instead, we decided to use the go-mysql-replay tool which could replay MySQL traffic from a packet capture.
After testing the go-mysql-replay tool we prepared the following procedure:
- Setup the new MySQL instance as a replication slave to ensure it had the same data as the production instance for the load test. We’d plan to halt replication just before the data capture by issuing a
FLUSH TABLES WITH READ LOCKon the master and executing
RESET SLAVE ALLon the new instance to ensure consistency.
- On the production instance, run tcpdump and generate a packet capture during the busiest 30 - 60 minute window. During this same window we’d use a stats tool to capture some key system and mysql performance metrics, like CPU and memory utilization, disk I/O, MySQL connections, and MySQL queries per second.
- On the new MySQL instance, replay the production traffic using the go-mysql-replay tool. This was a two step process. First, we would generate a second data file containing read-only queries and replay this data file to ensure the instance was warmed up (fill the InnoDB buffer pool). Second, replay the original data file containing read and write queries. For both parts, we recorded the same key metrics we observed during the production data capture. It’s worth noting that we considered running the replay from another server to prevent the tool’s execution from skewing our metrics but this turned out to be a non-issue. We were aware of this as a potential complication and were prepared to address it.
- Assess the performance of the new system by reviewing the metrics we observed, comparing the overall runtime of the replay, and examining the go-mysql-replay logs which contained timing information pertinent to performance of MySQL.
Our first load test did not go as planned. About two minutes into the load test the tool appeared to lock up and I/O fell to nearly zero. Thankfully, the go-mysql-replay tool was open-source and we were able to dig into the code. It turned out there was an issue that could cause it to block while reading/parsing the data file. We added a fix and re-compiled the tool.
Unfortunately, our next three load tests were unsuccessful. The tool would replay traffic, but we discovered various issues ranging from multi-second delays in replay traffic to exceptionally low throughput. Additionally, MySQL was reporting intermitent
Aborted connection errors. We performed extensive debugging and tracked the issue down to the go-mysql-replay tool. We couldn’t quickly isolate the issue to a specific problem, but the
Aborted connection errors led us to believe it was an issue with how go-mysql-replay handles connection pooling. This turned out to be correct and, after a minor rewrite, it began performing as expected.
Our fifth and final load test was a success! The tool replayed 30 minutes of traffic in just under 30 minutes. Additionally, the metrics looked good and indicated we might see the performance improvements we were expecting.
My takeway from this project? A continued appreciation for open-source software and the ability to fix your own problems :)