When you do IO to rotating disks, you typically want to sort randomized IO and send them off to the drive in a (somewhat) ordered fashion. Even though we don't really know what the geometric layout of the drive is, the assumption is that there's a correlation between proximity of LBAs and on-disk locality. To that extent, the IO schedulers need some way of sorting the input. All the Linux IO schedulers that utilize sorting currently use one or more red-black trees to do that. This has been the default since I originally moved away from regular list sorting, a switch that happened many years ago. The choice of rb trees was mostly driven by laziness - the kernel already had an rbtree library, so plugging into that was the easy choice.
So recently I spent some time evaluating this choice. But before diving into the candidates, lets first evaluate the benchmark criteria. To do that, we need to understand how the sorting is used in the kernel. Insertion is done directly in the process context then it wants to read or write a piece of data. Extraction and deletion is usually done in interrupt (or, more precisely, soft irq context) when a request has finished a request. Drivers that do queueing will do the extraction when queueing is invoked, however that also often happens from the same soft irq context since when a request finishes, the driver will often attempt to issue a new one right away. For performance reasons, it then follows that we wish to have the bulk of the total insert/extraction/deletion cost moved to the insertion phase of the operation.
The total number of elements operated on is restricted by the number of IO requests we allow any given queue to allocate. The default here is 128 per data direction, or 256 in total. If we assume a queue depth of 32 on the hardware side, the leaves a maximum of anywhere in between 224 and 256 in the queue at any given time. A queue is rarely maxed out in both directions, so the number is usually lower than that. Lets again make the assumption that we use half of the available resources to keep the disk busy, which leaves us with 128 in total and 96 active in the queue at all times.
Now our test parameters and criteria are mostly defined. The last parameter is the input data. Sorting algorithms will behave differently with various types of input, so for this test we'll use two data sets. One will be mainly sequential IO, the other will be randomized IO. The test will run each data set for X number of total ops, with Y in the queue always, and Z elements being extracted when Y reaches Y_max (which is 128). The baseline (1.0 score) will be the default rbtree implementation. A score of 2.0 is twice as slow as the baseline, 0.5 is twice as fast.
For selecting candidates, a did a bit of book reading and google searching. I wanted to implement sorting algorithms that didn't require any type of node allocation on insert, as that quickly becomes problematic for IO operations. It also makes things slower than embedding the sort node inside the request structure that we already have. Some algorithms were selected just for kicks, others I thought had more potential to actually be useful.
Candidate #1 -
AVL tree. Like rbtree, this is self-balancing tree structure. Performance
should be close to rbtree.
Candidate #2 -
Fibonacci heap. This is a more advanced data structure, utilizing a
forest of trees. It has a more complex runtime analysis, with better
amortized running time than typical binary trees.
Candidate #3 -
Skip lists. This is a different beast, basically a linked list on drugs using
multiple lists with links of different levels determined in a randomized
fashion.
Candidate #4 - a normal linked list. Needs no further introduction.
The numbers below give performance numbers for the two input data sets, and lists
three numbers for each data set.
It is the insert time,
Et is the extraction time, and
Tt is the total time.
Randomized input