Log in

Jens Axboe's blog

Dec. 16th, 2008

10:23 am - Sorting algorithms

When you do IO to rotating disks, you typically want to sort randomized IO and send them off to the drive in a (somewhat) ordered fashion. Even though we don't really know what the geometric layout of the drive is, the assumption is that there's a correlation between proximity of LBAs and on-disk locality. To that extent, the IO schedulers need some way of sorting the input. All the Linux IO schedulers that utilize sorting currently use one or more red-black trees to do that. This has been the default since I originally moved away from regular list sorting, a switch that happened many years ago. The choice of rb trees was mostly driven by laziness - the kernel already had an rbtree library, so plugging into that was the easy choice.
So recently I spent some time evaluating this choice. But before diving into the candidates, lets first evaluate the benchmark criteria. To do that, we need to understand how the sorting is used in the kernel. Insertion is done directly in the process context then it wants to read or write a piece of data. Extraction and deletion is usually done in interrupt (or, more precisely, soft irq context) when a request has finished a request. Drivers that do queueing will do the extraction when queueing is invoked, however that also often happens from the same soft irq context since when a request finishes, the driver will often attempt to issue a new one right away. For performance reasons, it then follows that we wish to have the bulk of the total insert/extraction/deletion cost moved to the insertion phase of the operation.

The total number of elements operated on is restricted by the number of IO requests we allow any given queue to allocate. The default here is 128 per data direction, or 256 in total. If we assume a queue depth of 32 on the hardware side, the leaves a maximum of anywhere in between 224 and 256 in the queue at any given time. A queue is rarely maxed out in both directions, so the number is usually lower than that. Lets again make the assumption that we use half of the available resources to keep the disk busy, which leaves us with 128 in total and 96 active in the queue at all times.

Now our test parameters and criteria are mostly defined. The last parameter is the input data. Sorting algorithms will behave differently with various types of input, so for this test we'll use two data sets. One will be mainly sequential IO, the other will be randomized IO. The test will run each data set for X number of total ops, with Y in the queue always, and Z elements being extracted when Y reaches Y_max (which is 128). The baseline (1.0 score) will be the default rbtree implementation. A score of 2.0 is twice as slow as the baseline, 0.5 is twice as fast.

For selecting candidates, a did a bit of book reading and google searching. I wanted to implement sorting algorithms that didn't require any type of node allocation on insert, as that quickly becomes problematic for IO operations. It also makes things slower than embedding the sort node inside the request structure that we already have. Some algorithms were selected just for kicks, others I thought had more potential to actually be useful.

Candidate #1 - AVL tree. Like rbtree, this is self-balancing tree structure. Performance
                          should be close to rbtree.

Candidate #2 - Fibonacci heap. This is a more advanced data structure, utilizing a
                          forest of trees. It has a more complex runtime analysis, with better
                          amortized running time than typical binary trees.

Candidate #3 - Skip lists. This is a different beast, basically a linked list on drugs using
                          multiple lists with links of different levels determined in a randomized

Candidate #4 - a normal linked list. Needs no further introduction.

The numbers below give performance numbers for the two input data sets, and lists
three numbers for each data set. It is the insert time, Et is the extraction time, and Tt is the total time.

Randomized input

Sequential input

To start from the bottom, it's no real surprise that a list insert is the fastest one for data that arrives pretty much completely sorted. The insertion will only be a check of the back of the list followed by an insert, a O(1) operation. Extraction is always O(1) from lists, so extraction will always be impossible to beat. AVL is pretty close to rbtrees in performance, but loses on all accounts. So we can rule that one out. Ditto skiplists, they always perform worse. The fibonacci heap does exceptionally well on insert always, but extraction is slooow. It's also pretty complex implementation wise, so we'd need a lot of justification to add something like that to the block layer.

It appears that the previous choice of rbtrees wasn't totally off. They perform acceptably for sequential IO and well for random IO. If we enlarge the request pool, rbtrees would do much better than list insert for random IO. Memory consumption is also about the same. The node size for rbtrees is sizeof(unsigned long) bigger than for the list, close enough that it doesn't matter in real life.

So I don't plan to change from rbtrees anytime soon. I may consider a hybrid structure of some sort that just avoids doing a lot of work for sequential IO, since we can detect that. Then we only do real sorting on more random input, and then the rbtree will always be a win. For non-rotating devices, a non-sorting list will be utilized. This should be a big win for high IOPS rate SSD like devices.

Dec. 10th, 2008

03:13 pm - Oracle and Linux

It's no secret that I work for Oracle and have been doing so for over 2 years now. I post emails with my Oracle email address every day and every commit that I put in the Linux kernel is tagged and signed-off with that email as well. So why am I writing this? Well, something has been annoying me for quite some time now - why does Oracle not get the recognition it deserves in the Linux development community? Most of the time when I see Oracle Linux projects mentioned in some article, there's always some snide comment on that particular association.

Oracle pays for all the work I do for the kernel. Some of that is directly related to what Oracle wants to have improved in the kernel, which is actually why I went to work for them in the first place. I mainly work on IO related projects, which is something that Oracle has a great interest in (I'm sure that's not a surprise to anyone!). But quite a lot of work is also very geared towards just making Linux better in a variety of ways and areas. I spend time on general block layer maintenance, integrating and testing patches from other contributors which isn't necessarily directly beneficial to my employer. And I'm not the only mainline kernel contributor. Chris Mason is spending a LOT of time working on btrfs, which promises to be a great next generation file system for Linux in general. Zach Brown is working on CRFS which looks like it's going to be a very interesting cache coherent remote file system. Randy Dunlap does a lot of janitorial, testing, and documentation work. There are others as well, I'm only mentioning the most community known people here.

Even the kernel commit/changes statistics that are posted every now and then show Oracle being safely in the top-10 as a contributor, yet they are routinely forgotten when people think of the big contributors. It's just a mystery to me, I wish our Linux branding was better!

Tags: ,

Dec. 9th, 2008

01:26 pm - SSD optimizations

A few months ago I got my hands on a sample of the Intel X25-E SSD drives for testing purposes. There are lots of interesting things to be said about SSD vs rotational devices, but my main interest in these devices is largely that they offer a good test base for high IOPS rates testing. So yes, the device read/write bandwidth definitely kicks a lot of ass, in fact so much that it's the first SSD out there that I would recommend to other people. I've played with various other models in the past, even expensive ones. And while they do have nice read performance, they fall flat on their face when presented with random write workloads. Additionally, apart from the flash itself, the drive internals are really outdated - most are using SATA-I 1.5gbps link speeds, and none of them offer any kind of command queuing. What are they thinking?

Anyway, back to the topic at hand. As my initial primary goal with this device was increasing the IOPS rate of the Linux IO stack, testing was mainly done with O_DIRECT libaio. As with most of my testing, I tend to use fio for quickly setting up a prototype workload. I wanted to use small block sizes for various reasons. This brings the IOPS rate up, thus better high lighting problem areas in Linux that would tend to be hidden more with larger block sizes. My concern with O_DIRECT and small block sizes is that get_user_pages() overhead would dominate the per-command overhead, however those concerns turned out to be completely unfounded after Nick Piggins fast path get_user_pages() patches have gone in. That is nice, since I didn't really want to spend my time optimizing the VM for this!

The first bottleneck that reared its ugly head was hpet read in the kernel. It was so large that switching to a tsc based clock source for the kernel increased the IOPS rate by over 10%. After a bit of head scratching, I started looking at how many gettimeofday() calls that fio generates for this workload - a LOT. Fio wants to tell you pretty much everything about IO latencies, both going in to the kernel, at the device end, and reaping them. This in turn generates a lot of gettimeofday() traffic. A few hacks and options were put into fio to diminish the number of gtod calls and things looked much better with the default clock source.

To step back a bit before going further, one usually has a preconceived notion of a few changes that'll speed things up before doing this type of testing. One such thing was the IO plug management in Linux. Plugging is meant to optimize the command size for the device, at the expense of a bit of latency on submission. This makes a lot of sense for rotational storage, not so much for SSD devices. Turns out that disabling plugging gains us about 1% on just this device, when doing 30k-40k IOPS.

After disabling plugging, I went further with the profiling. The top entry is (not surprising) the ahci interrupt handler. There are ways to speed it up a bit without going too far, but my goal was IO stack reduction in general, so I left it alone for now. Next on the list was slab allocations. When issuing a piece of IO in Linux, we do quite a few allocations along the way. First we allocate a bio, then we allocate a bio_vec list to fill pages into. The bio_vec allocations come from a number of slab pools, sized in powers of two (1 entry, 4, 16, 64, 128, and 256). Then a request is allocated to attach this bio to. Going into the driver layers, SCSI allocates a SCSI command and sense buffer before sending it to the device. So that's 5 allocations just for the IO request, and there can easily be more from files ystems etc. We can't get away with not allocating a bio, but we can eliminate at least some of bio_vec allocations fairly easily. Most of the IO issued in a system is fairly small, in the range of 4 - 16kb. If we embed the bio_vec into the bio for a small number of pages, we can get rid of this allocation. To take that step a bit further, I wanted to incorporate a related change that Chris Mason had previously voiced an interest in. When a file system allocates a bio, it typically allocates a private structure for information related to that piece of IO. So the patch set in full allows for a piece of memory to be reserved both at the front of the bio (for file systems) and at the end (for the bio_vec). This then combines these three allocations into one. A similar trick was done for the SCSI command, so that the sense buffer is allocated at the very end as well. Finally, I added a struct request allocation cache to avoid going into the memory allocator there as well. The end result is that we gained 3-4% for this workload.

Another entry that was high in the profile list was lookup_ioctx(). This is what finds the kernel aio context for a given process when that process does an aio io_submit() to submit a piece of IO. When I read the kernel implementation, several red lights whent off in my head. The lookup was a doubly linked list, guarded by a read/write spinlock. While the O(n) choice of a linked list may seem problematic, there's usually only a single entry on this list since processes generally do not set up a lot of IO contexts to submit IO against. But experience tells me that reader/write locks are usually trouble, as they are much slower than a normal spinlock. Personally I think the use of them is usually a design mistake and that it would be better if we removed them from the kernel! The list management in aio was opencoded, so I converted that to a hlist structure and protected it with RCU. Using RCU here makes a lot of sense, since the list is basically only manipulated when the IO context is setup or torn down - for the long duration of actually submitting IO, it's purely a reader side thing. This got us about 2% gain on even a puny 2-way system.

At this years kernel summit, I had a talk with Matthew Wilcox about a recent change to the IO accounting that Intel had found troublesome. In the 2.4 days, we had per-partition statistics on the IO request, while for 2.6 kernels we had dumped that feature. About a year ago that feature was reinstated. The partition lookup scans the kernel partition table to find the right partition to account, and if that partition number is in the higher range, we end up spending a bit of time doing this for every IO. You don't notice if you are primarily doing IO to sda1 or sda2, but if sda16 is hit a lot it starts to show. I added a one-hit cache in front of this lookup and got rid of most of that lookup time. This trick is similar to what we do for IO merging, and it works exceptionally well. The reason being that while you typically have more than one partition active for IO at any given point in time, submissions tend to come in batches. If these batches are large enough, we get a lot of hits in the one-hit cache before having to invalidate it. So this again got us a few percent of improvement for this type of scenario.

I'll stop detailing the optimizations here and save some for future blog entries, there's still lots of improvements to be made and I have lots of stuff in-progress that I hope will be very interesting. And I haven't even gotten to buffered IO yet! If you are curious about the above changes, you are encouraged to inspect the for-2.6.29 branch of my block git repo. Find that here. In the ssd branch there are more experimental changes that will need a bit longer to mature.

Navigate: (Next 10 Entries)