You are viewing axboe

Jens Axboe's blog - Sorting algorithms

Dec. 16th, 2008

10:23 am - Sorting algorithms

Previous Entry Add to Memories Share Next Entry




 
 
 
When you do IO to rotating disks, you typically want to sort randomized IO and send them off to the drive in a (somewhat) ordered fashion. Even though we don't really know what the geometric layout of the drive is, the assumption is that there's a correlation between proximity of LBAs and on-disk locality. To that extent, the IO schedulers need some way of sorting the input. All the Linux IO schedulers that utilize sorting currently use one or more red-black trees to do that. This has been the default since I originally moved away from regular list sorting, a switch that happened many years ago. The choice of rb trees was mostly driven by laziness - the kernel already had an rbtree library, so plugging into that was the easy choice.
 
So recently I spent some time evaluating this choice. But before diving into the candidates, lets first evaluate the benchmark criteria. To do that, we need to understand how the sorting is used in the kernel. Insertion is done directly in the process context then it wants to read or write a piece of data. Extraction and deletion is usually done in interrupt (or, more precisely, soft irq context) when a request has finished a request. Drivers that do queueing will do the extraction when queueing is invoked, however that also often happens from the same soft irq context since when a request finishes, the driver will often attempt to issue a new one right away. For performance reasons, it then follows that we wish to have the bulk of the total insert/extraction/deletion cost moved to the insertion phase of the operation.

The total number of elements operated on is restricted by the number of IO requests we allow any given queue to allocate. The default here is 128 per data direction, or 256 in total. If we assume a queue depth of 32 on the hardware side, the leaves a maximum of anywhere in between 224 and 256 in the queue at any given time. A queue is rarely maxed out in both directions, so the number is usually lower than that. Lets again make the assumption that we use half of the available resources to keep the disk busy, which leaves us with 128 in total and 96 active in the queue at all times.

Now our test parameters and criteria are mostly defined. The last parameter is the input data. Sorting algorithms will behave differently with various types of input, so for this test we'll use two data sets. One will be mainly sequential IO, the other will be randomized IO. The test will run each data set for X number of total ops, with Y in the queue always, and Z elements being extracted when Y reaches Y_max (which is 128). The baseline (1.0 score) will be the default rbtree implementation. A score of 2.0 is twice as slow as the baseline, 0.5 is twice as fast.

For selecting candidates, a did a bit of book reading and google searching. I wanted to implement sorting algorithms that didn't require any type of node allocation on insert, as that quickly becomes problematic for IO operations. It also makes things slower than embedding the sort node inside the request structure that we already have. Some algorithms were selected just for kicks, others I thought had more potential to actually be useful.

Candidate #1 - AVL tree. Like rbtree, this is self-balancing tree structure. Performance
                          should be close to rbtree.

Candidate #2 - Fibonacci heap. This is a more advanced data structure, utilizing a
                          forest of trees. It has a more complex runtime analysis, with better
                          amortized running time than typical binary trees.

Candidate #3 - Skip lists. This is a different beast, basically a linked list on drugs using
                          multiple lists with links of different levels determined in a randomized
                          fashion.

Candidate #4 - a normal linked list. Needs no further introduction.

The numbers below give performance numbers for the two input data sets, and lists
three numbers for each data set. It is the insert time, Et is the extraction time, and Tt is the total time.

Randomized input
AlgorithmItEt
Tt
AVL1.162.131.50
Fibheap0.574.051.82
Skiplist2.152.672.34
List1.560.181.07
Rbtree1.001.001.00

Sequential input
AlgorithmItEtTt
AVL0.922.701.28
Fibheap0.527.191.87
Skiplist1.823.292.11
List0.260.310.27
Rbtree1.001.001.00


To start from the bottom, it's no real surprise that a list insert is the fastest one for data that arrives pretty much completely sorted. The insertion will only be a check of the back of the list followed by an insert, a O(1) operation. Extraction is always O(1) from lists, so extraction will always be impossible to beat. AVL is pretty close to rbtrees in performance, but loses on all accounts. So we can rule that one out. Ditto skiplists, they always perform worse. The fibonacci heap does exceptionally well on insert always, but extraction is slooow. It's also pretty complex implementation wise, so we'd need a lot of justification to add something like that to the block layer.

It appears that the previous choice of rbtrees wasn't totally off. They perform acceptably for sequential IO and well for random IO. If we enlarge the request pool, rbtrees would do much better than list insert for random IO. Memory consumption is also about the same. The node size for rbtrees is sizeof(unsigned long) bigger than for the list, close enough that it doesn't matter in real life.

So I don't plan to change from rbtrees anytime soon. I may consider a hybrid structure of some sort that just avoids doing a lot of work for sequential IO, since we can detect that. Then we only do real sorting on more random input, and then the rbtree will always be a win. For non-rotating devices, a non-sorting list will be utilized. This should be a big win for high IOPS rate SSD like devices.