Jens Axboe's blog - SSD optimizations
Dec. 9th, 2008
01:26 pm - SSD optimizations
A few months ago I got my hands on a sample of the Intel X25-E SSD drives for testing purposes. There are lots of interesting things to be said about SSD vs rotational devices, but my main interest in these devices is largely that they offer a good test base for high IOPS rates testing. So yes, the device read/write bandwidth definitely kicks a lot of ass, in fact so much that it's the first SSD out there that I would recommend to other people. I've played with various other models in the past, even expensive ones. And while they do have nice read performance, they fall flat on their face when presented with random write workloads. Additionally, apart from the flash itself, the drive internals are really outdated - most are using SATA-I 1.5gbps link speeds, and none of them offer any kind of command queuing. What are they thinking?
Anyway, back to the topic at hand. As my initial primary goal with this device was increasing the IOPS rate of the Linux IO stack, testing was mainly done with O_DIRECT libaio. As with most of my testing, I tend to use fio for quickly setting up a prototype workload. I wanted to use small block sizes for various reasons. This brings the IOPS rate up, thus better high lighting problem areas in Linux that would tend to be hidden more with larger block sizes. My concern with O_DIRECT and small block sizes is that get_user_pages() overhead would dominate the per-command overhead, however those concerns turned out to be completely unfounded after Nick Piggins fast path get_user_pages() patches have gone in. That is nice, since I didn't really want to spend my time optimizing the VM for this!
The first bottleneck that reared its ugly head was hpet read in the kernel. It was so large that switching to a tsc based clock source for the kernel increased the IOPS rate by over 10%. After a bit of head scratching, I started looking at how many gettimeofday() calls that fio generates for this workload - a LOT. Fio wants to tell you pretty much everything about IO latencies, both going in to the kernel, at the device end, and reaping them. This in turn generates a lot of gettimeofday() traffic. A few hacks and options were put into fio to diminish the number of gtod calls and things looked much better with the default clock source.
To step back a bit before going further, one usually has a preconceived notion of a few changes that'll speed things up before doing this type of testing. One such thing was the IO plug management in Linux. Plugging is meant to optimize the command size for the device, at the expense of a bit of latency on submission. This makes a lot of sense for rotational storage, not so much for SSD devices. Turns out that disabling plugging gains us about 1% on just this device, when doing 30k-40k IOPS.
After disabling plugging, I went further with the profiling. The top entry is (not surprising) the ahci interrupt handler. There are ways to speed it up a bit without going too far, but my goal was IO stack reduction in general, so I left it alone for now. Next on the list was slab allocations. When issuing a piece of IO in Linux, we do quite a few allocations along the way. First we allocate a bio, then we allocate a bio_vec list to fill pages into. The bio_vec allocations come from a number of slab pools, sized in powers of two (1 entry, 4, 16, 64, 128, and 256). Then a request is allocated to attach this bio to. Going into the driver layers, SCSI allocates a SCSI command and sense buffer before sending it to the device. So that's 5 allocations just for the IO request, and there can easily be more from files ystems etc. We can't get away with not allocating a bio, but we can eliminate at least some of bio_vec allocations fairly easily. Most of the IO issued in a system is fairly small, in the range of 4 - 16kb. If we embed the bio_vec into the bio for a small number of pages, we can get rid of this allocation. To take that step a bit further, I wanted to incorporate a related change that Chris Mason had previously voiced an interest in. When a file system allocates a bio, it typically allocates a private structure for information related to that piece of IO. So the patch set in full allows for a piece of memory to be reserved both at the front of the bio (for file systems) and at the end (for the bio_vec). This then combines these three allocations into one. A similar trick was done for the SCSI command, so that the sense buffer is allocated at the very end as well. Finally, I added a struct request allocation cache to avoid going into the memory allocator there as well. The end result is that we gained 3-4% for this workload.
Another entry that was high in the profile list was lookup_ioctx(). This is what finds the kernel aio context for a given process when that process does an aio io_submit() to submit a piece of IO. When I read the kernel implementation, several red lights whent off in my head. The lookup was a doubly linked list, guarded by a read/write spinlock. While the O(n) choice of a linked list may seem problematic, there's usually only a single entry on this list since processes generally do not set up a lot of IO contexts to submit IO against. But experience tells me that reader/write locks are usually trouble, as they are much slower than a normal spinlock. Personally I think the use of them is usually a design mistake and that it would be better if we removed them from the kernel! The list management in aio was opencoded, so I converted that to a hlist structure and protected it with RCU. Using RCU here makes a lot of sense, since the list is basically only manipulated when the IO context is setup or torn down - for the long duration of actually submitting IO, it's purely a reader side thing. This got us about 2% gain on even a puny 2-way system.
At this years kernel summit, I had a talk with Matthew Wilcox about a recent change to the IO accounting that Intel had found troublesome. In the 2.4 days, we had per-partition statistics on the IO request, while for 2.6 kernels we had dumped that feature. About a year ago that feature was reinstated. The partition lookup scans the kernel partition table to find the right partition to account, and if that partition number is in the higher range, we end up spending a bit of time doing this for every IO. You don't notice if you are primarily doing IO to sda1 or sda2, but if sda16 is hit a lot it starts to show. I added a one-hit cache in front of this lookup and got rid of most of that lookup time. This trick is similar to what we do for IO merging, and it works exceptionally well. The reason being that while you typically have more than one partition active for IO at any given point in time, submissions tend to come in batches. If these batches are large enough, we get a lot of hits in the one-hit cache before having to invalidate it. So this again got us a few percent of improvement for this type of scenario.
I'll stop detailing the optimizations here and save some for future blog entries, there's still lots of improvements to be made and I have lots of stuff in-progress that I hope will be very interesting. And I haven't even gotten to buffered IO yet! If you are curious about the above changes, you are encouraged to inspect the for-2.6.29 branch of my block git repo. Find that here. In the ssd branch there are more experimental changes that will need a bit longer to mature.