Jens Axboe's blog
Dec. 16th, 2008
10:23 am - Sorting algorithms
To start from the bottom, it's no real surprise that a list insert is the fastest one for data that arrives pretty much completely sorted. The insertion will only be a check of the back of the list followed by an insert, a O(1) operation. Extraction is always O(1) from lists, so extraction will always be impossible to beat. AVL is pretty close to rbtrees in performance, but loses on all accounts. So we can rule that one out. Ditto skiplists, they always perform worse. The fibonacci heap does exceptionally well on insert always, but extraction is slooow. It's also pretty complex implementation wise, so we'd need a lot of justification to add something like that to the block layer.
It appears that the previous choice of rbtrees wasn't totally off. They perform acceptably for sequential IO and well for random IO. If we enlarge the request pool, rbtrees would do much better than list insert for random IO. Memory consumption is also about the same. The node size for rbtrees is sizeof(unsigned long) bigger than for the list, close enough that it doesn't matter in real life.
So I don't plan to change from rbtrees anytime soon. I may consider a hybrid structure of some sort that just avoids doing a lot of work for sequential IO, since we can detect that. Then we only do real sorting on more random input, and then the rbtree will always be a win. For non-rotating devices, a non-sorting list will be utilized. This should be a big win for high IOPS rate SSD like devices.
Dec. 10th, 2008
03:13 pm - Oracle and Linux
It's no secret that I work for Oracle and have been doing so for over 2 years now. I post emails with my Oracle email address every day and every commit that I put in the Linux kernel is tagged and signed-off with that email as well. So why am I writing this? Well, something has been annoying me for quite some time now - why does Oracle not get the recognition it deserves in the Linux development community? Most of the time when I see Oracle Linux projects mentioned in some article, there's always some snide comment on that particular association.
Oracle pays for all the work I do for the kernel. Some of that is directly related to what Oracle wants to have improved in the kernel, which is actually why I went to work for them in the first place. I mainly work on IO related projects, which is something that Oracle has a great interest in (I'm sure that's not a surprise to anyone!). But quite a lot of work is also very geared towards just making Linux better in a variety of ways and areas. I spend time on general block layer maintenance, integrating and testing patches from other contributors which isn't necessarily directly beneficial to my employer. And I'm not the only mainline kernel contributor. Chris Mason is spending a LOT of time working on btrfs, which promises to be a great next generation file system for Linux in general. Zach Brown is working on CRFS which looks like it's going to be a very interesting cache coherent remote file system. Randy Dunlap does a lot of janitorial, testing, and documentation work. There are others as well, I'm only mentioning the most community known people here.
Even the kernel commit/changes statistics that are posted every now and then show Oracle being safely in the top-10 as a contributor, yet they are routinely forgotten when people think of the big contributors. It's just a mystery to me, I wish our Linux branding was better!
Dec. 9th, 2008
01:26 pm - SSD optimizations
A few months ago I got my hands on a sample of the Intel X25-E SSD drives for testing purposes. There are lots of interesting things to be said about SSD vs rotational devices, but my main interest in these devices is largely that they offer a good test base for high IOPS rates testing. So yes, the device read/write bandwidth definitely kicks a lot of ass, in fact so much that it's the first SSD out there that I would recommend to other people. I've played with various other models in the past, even expensive ones. And while they do have nice read performance, they fall flat on their face when presented with random write workloads. Additionally, apart from the flash itself, the drive internals are really outdated - most are using SATA-I 1.5gbps link speeds, and none of them offer any kind of command queuing. What are they thinking?
Anyway, back to the topic at hand. As my initial primary goal with this device was increasing the IOPS rate of the Linux IO stack, testing was mainly done with O_DIRECT libaio. As with most of my testing, I tend to use fio for quickly setting up a prototype workload. I wanted to use small block sizes for various reasons. This brings the IOPS rate up, thus better high lighting problem areas in Linux that would tend to be hidden more with larger block sizes. My concern with O_DIRECT and small block sizes is that get_user_pages() overhead would dominate the per-command overhead, however those concerns turned out to be completely unfounded after Nick Piggins fast path get_user_pages() patches have gone in. That is nice, since I didn't really want to spend my time optimizing the VM for this!
The first bottleneck that reared its ugly head was hpet read in the kernel. It was so large that switching to a tsc based clock source for the kernel increased the IOPS rate by over 10%. After a bit of head scratching, I started looking at how many gettimeofday() calls that fio generates for this workload - a LOT. Fio wants to tell you pretty much everything about IO latencies, both going in to the kernel, at the device end, and reaping them. This in turn generates a lot of gettimeofday() traffic. A few hacks and options were put into fio to diminish the number of gtod calls and things looked much better with the default clock source.
To step back a bit before going further, one usually has a preconceived notion of a few changes that'll speed things up before doing this type of testing. One such thing was the IO plug management in Linux. Plugging is meant to optimize the command size for the device, at the expense of a bit of latency on submission. This makes a lot of sense for rotational storage, not so much for SSD devices. Turns out that disabling plugging gains us about 1% on just this device, when doing 30k-40k IOPS.
After disabling plugging, I went further with the profiling. The top entry is (not surprising) the ahci interrupt handler. There are ways to speed it up a bit without going too far, but my goal was IO stack reduction in general, so I left it alone for now. Next on the list was slab allocations. When issuing a piece of IO in Linux, we do quite a few allocations along the way. First we allocate a bio, then we allocate a bio_vec list to fill pages into. The bio_vec allocations come from a number of slab pools, sized in powers of two (1 entry, 4, 16, 64, 128, and 256). Then a request is allocated to attach this bio to. Going into the driver layers, SCSI allocates a SCSI command and sense buffer before sending it to the device. So that's 5 allocations just for the IO request, and there can easily be more from files ystems etc. We can't get away with not allocating a bio, but we can eliminate at least some of bio_vec allocations fairly easily. Most of the IO issued in a system is fairly small, in the range of 4 - 16kb. If we embed the bio_vec into the bio for a small number of pages, we can get rid of this allocation. To take that step a bit further, I wanted to incorporate a related change that Chris Mason had previously voiced an interest in. When a file system allocates a bio, it typically allocates a private structure for information related to that piece of IO. So the patch set in full allows for a piece of memory to be reserved both at the front of the bio (for file systems) and at the end (for the bio_vec). This then combines these three allocations into one. A similar trick was done for the SCSI command, so that the sense buffer is allocated at the very end as well. Finally, I added a struct request allocation cache to avoid going into the memory allocator there as well. The end result is that we gained 3-4% for this workload.
Another entry that was high in the profile list was lookup_ioctx(). This is what finds the kernel aio context for a given process when that process does an aio io_submit() to submit a piece of IO. When I read the kernel implementation, several red lights whent off in my head. The lookup was a doubly linked list, guarded by a read/write spinlock. While the O(n) choice of a linked list may seem problematic, there's usually only a single entry on this list since processes generally do not set up a lot of IO contexts to submit IO against. But experience tells me that reader/write locks are usually trouble, as they are much slower than a normal spinlock. Personally I think the use of them is usually a design mistake and that it would be better if we removed them from the kernel! The list management in aio was opencoded, so I converted that to a hlist structure and protected it with RCU. Using RCU here makes a lot of sense, since the list is basically only manipulated when the IO context is setup or torn down - for the long duration of actually submitting IO, it's purely a reader side thing. This got us about 2% gain on even a puny 2-way system.
At this years kernel summit, I had a talk with Matthew Wilcox about a recent change to the IO accounting that Intel had found troublesome. In the 2.4 days, we had per-partition statistics on the IO request, while for 2.6 kernels we had dumped that feature. About a year ago that feature was reinstated. The partition lookup scans the kernel partition table to find the right partition to account, and if that partition number is in the higher range, we end up spending a bit of time doing this for every IO. You don't notice if you are primarily doing IO to sda1 or sda2, but if sda16 is hit a lot it starts to show. I added a one-hit cache in front of this lookup and got rid of most of that lookup time. This trick is similar to what we do for IO merging, and it works exceptionally well. The reason being that while you typically have more than one partition active for IO at any given point in time, submissions tend to come in batches. If these batches are large enough, we get a lot of hits in the one-hit cache before having to invalidate it. So this again got us a few percent of improvement for this type of scenario.
I'll stop detailing the optimizations here and save some for future blog entries, there's still lots of improvements to be made and I have lots of stuff in-progress that I hope will be very interesting. And I haven't even gotten to buffered IO yet! If you are curious about the above changes, you are encouraged to inspect the for-2.6.29 branch of my block git repo. Find that here. In the ssd branch there are more experimental changes that will need a bit longer to mature.
Navigate: (Next 10 Entries)