Jens Axboe's blog
Sep. 25th, 2009
06:32 am - LPC 2009 talk
I gave a talk on the new per-bdi writeback threads at the Linux Plumbers Conference today, which I think went pretty well. Since I've had various people ask me for the slides, here they are.
Jun. 22nd, 2009
09:22 pm - Cheaper SSD reliability, continued
So after twice promising me to get info from 'an engineer' in a 10 day time span, I pushed OCZ again today. The answer is that it's likely "bad blocks" on the drive and they offered to exchange it. Now, I don't know the internal secret sauce to their flash chip striping, but a bad flash block may explain the issue. And personally I care a lot less about this specific drive than the larger issue at hand, which is: Can we trust these SSD drives? From this experience, the answer so far is unequivocally no. If the flash block/page/erase block is indeed part (or partially bad), I want to know! Don't just send me all ones in the data, that's not acceptable. If their drive/fw doesn't error handling, I'm quite sure the customers would like to know this fact.
Apparently Indilinx does the firmware for these drives for at least two manufacturers. Given that the SSD consumer market is steaming ahead at this point, I'm guessing there's a huge rush to reduce the time to market. A safe bet would be that the firmware is perhaps a little too rushed in this case. Coincidentally, the bad drive is running fw 1.10. Version 1.30 lists this little juicy fix among the others: "Read fail handling". I'll try 1.30 on this drive and re-read the data, just to see what happens.
For the time being, I can't recommend using Indilinx based drives anywhere except for throw away data. If they can't even tell you when the data has gone bad, then they really can't be used for much else. At least use btrfs with data checksums enabled, then you could catch a problem like this. Yes I did run btrfs on my Vertex, and yes I did disabled datacsum to avoid the extra CPU use on my laptop... My personal recommendation would be to stick with Intel or Samsung SSD drives where data matters.
Jun. 18th, 2009
12:10 pm - Cheaper SSD reliability?
In earlier blog entries, I praised the Intel X25-E for its performance. I also have high hopes for the reliability of the drive. By reliability, I refer to data integrity as well as endurance. It's not that I have much information to back this up, but I know that Intel have put a lot of testing into them.
So while the X25-E is extremely nice, it's also very pricey. Recently I needed a few more SSD's for testing purposes, and despite public begging on this blog, Intel hasn't sent me any more drives. As I'm sure most are well aware, the cheaper SSD drives were mostly utter crap. Even most expensive SSD drives have been crap, mostly due to using that infamous JMicron flash controller that would have done more good as nice sand on the beach instead of being manufactured into silicon. Now there are other alternatives though, and the Indilinx controller looked like a good option. OCZ recently introduced a Vertex series that uses this controller. Not only does it perform better, it's also not 80s ATA tech. It has NCQ and TRIM support, which is very nice.
I went out and bought a few of these for testing. One I put in the laptop and the other in a test box. Performance is good, even random 4kb writes actually work. This is where the crap SSD's fall apart. However, as opposed to the Intel drive, I didn't have a lot of faith in the reliability of these drives. Early firmwares were plagued with errors, and even the just released v1.30 firmware fixes issues that seem like rather basic functionality. An example of that would be mishandling ATA commands with zero sector count. But I decided to give them the benefit of the doubt.
A few days ago, I was working on the laptop at night as usual. Pushing out a few changes from my block git repository, git complained of a corrupt pack file. The pack file in question was from when I lasted repacked the repo back in February. It's read-only, about 380MB in size, and thus hasn't been written to since it was created some 4 months ago. I usually don't keep backups of my laptop data, since it's just a development environment and all my source is safe with git on a public server. As it just so happened, I had tested the new btrfs format a week earlier. In doing so, Chris asked me to keep an image of the drive so we could debug any potential problems with the new format. So I went and fetched the pack file from the backup and compared the two. The backup file was, as expected, fine. Looking into the nature of the corruption (basically finding out who to blame for the corruption), I found out that the corruption started 64519680 bytes into the file. So that's nicely 512b aligned, but not 4kb aligned. The corruption spanned 16KB in total. So far, so good. What I found out next was even more interesting: every other byte in file was correct, every other byte was 0xff!
That type of corruption just reeks of drive problems. I reported this issue to OCZ about a week ago. First level support quickly replied and passed the issue on to the engineers, but I have yet to hear anything from that side. I've kept the drive as-is if they want to inspect it. I'm not keeping my hopes up though, and I'm glad I'm not the OCZ engineer tasked with fixing it.
Meanwhile I put the other Vertex in the laptop and recreated my git tree. No issues seen so far, but suffice to say that my confidence level in these drives aren't that high. I'll be keeping backups if I put anything interesting on the laptop!
May. 20th, 2009
01:43 pm - pdflush epitaph
It seems it has been about 2 months since I last posted here. That's not due to lack of kernel activity, though real life has interfered a bit with the addition of one more son to the family.
The patch set has been undergoing some changes since I last posted. One is the ability to have more than one thread per backing device. This is supposed to be useful for extreme cases where a single CPU cannot keep up with a very fast device. I have yet to actually test this part, I'm hoping some of the interested parties will join the fun and add the file system related code that enables placement and flushing of dirty inodes on several writeback threads per bdi. Another change is lazy create/exit of flusher threads. pdflush has 2-8 threads running depending on what mood it is in. With the per-bdi flusher threads, they will not get created unless they are going to be working. If they have been idle for some time, they will exit again. So this should more smoothly respond to actual system demands, not much point in having 100 idle threads for 100 disks, if only a fraction of those disks are actually writeback busy in a period of time.
I've also done a bit of testing this week, results look pretty good. Most show the new approach reaching similar performance but at a lower system utilization percentage, or higher performance. So that's all good. Yanmin Zhang (Intel) ran into a bug (that may or may not already be fixed, I'll know tomorrow when tests are run with new patches) and posted a fio job file that he reproduced it with. I decided to run the test with and without the writeback patches to compare results. The disk used is an 32G Intel X25-E SSD and the file system is ext4.
| Kernel | Throughput | usr CPU | sys CPU | disk util |
| writeback | 175MB/sec | 17.55% | 43.04% | 97.80% |
| vanilla | 147MB/sec | 13.44% | 47.33% | 85.98% |
Pretty decent result, I'd say. Apart from the lower system utilization, the interesting bit is how the writeback patches actually enable us to keep the disk busy. ~86% utilization for the vanilla kernel is pretty depressing. The fio job file used was:
[global]
direct=0
ioengine=mmap
iodepth=256
iodepth_batch=32
size=1500M
bs=4k
pre_read=1
overwrite=1
numjobs=1
loops=5
runtime=600
group_reporting
directory=/data
[job_group0_sub0]
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
startdelay=0
rw=randwrite
filename=f1:f2
The next few days will be spend with polishing the patch set and posting version 5. That one should hopefully be ready for inclusion in the -next tree, and then be headed upstream for 2.6.31.
Oh, and that Intel disk kicks some serious ass. For sequential writes, it maintains 210MB/sec easily. I have a few OCZ Vertex disks as well which do pretty well for sequential writes too, for random writes the Intel drive is in a different league though. For my birthday, I want 4 more Intel disks for testing!
Mar. 12th, 2009
03:43 pm - Burying pdflush in the back yard
So this week I began playing with implementing an alternative approach to buffered writeback. Right now we have the pdflush threads taking care of this, but that has a number of annoying points that made me want to try something else. One is that writeout tends to be very lumpy, which is easily visible in vmstat. Another is that it doesn't play well with the request allocation scheme, since pdflush backs off when a queue becomes congested. And fairness between congested users and blocking users is... well not there.
Enter bdi threads. The first step was moving the dirty inodes to some place where bdi threads could easily get at them. So instead of putting them on the super_block lists, we put them on the bdi lists of similar names. One upside of this change is also that now we don't have to
do a linear search for the bdi, we have it upfront. The next step is forking a thread per bdi that does IO. My initial approach simply created a kernel thread when the bdi was registered, but I'm sure that lots of people will find that wasteful. So instead it now registers a forker thread on behalf of the default backing device (default_backing_dev_info), which takes care of creating the appropriate threads when someone calls bdi_start_writeback() on a bdi. It'll handle memory pressure conditions as well, find the details in the patch set.
Initial tests look pretty good, though I haven't done a whole lot of testing on this yet. It's still very fresh code. I posted it on lkml today, you can find the individual patches and complete description there. As always, the patches are also in my git repo. Find them in the writeback branch here.
Jan. 19th, 2009
03:19 pm - Buffered async IO
We have this infrastructure in the kernel for doing asynchronous IO, which sits in fs/aio.c and fs/direct-io.c mainly. It works fine and it's pretty fast. From userspace you would link with -laio and use the functions there for hooking into the syscalls. However, it's only really async for direct uncached IO (files opened with O_DIRECT). This last little detail essentially means that it's largely useless to most people. Oracle uses it, and other databases may as well. But nobody uses aio on the desktop or elsewhere simply because it isn't a good fit when O_DIRECT is required - you then need aligned IO buffers and transfer sizes and you lose readahead. The alignment and size restrictions also make it difficult to convert existing apps to use libaio with O_DIRECT, since it requires more than just a straight forward conversion. It adds complexity.
Over the years, several alternative approaches to async IO have been proposed/developed. We've had several variants of uatom/syslet/fibril/threadlet type schemes, which all boil down to (just about) the same type of implementation - when a process is about to block inside the kernel, a cloned process/thread returns to userspace on behalf of the original process and informs it of the postponed work. The completed events can then later be retrieved through some sort of get_event/wait_event type interface, similar to how you reap completions with other types of async IO. Or the implementation provided a callback type scheme, similar to how eg a signal would be received in the process. This sort of setup performed acceptably, but suffered from the schizophrenic disorder of split personalities due to the change in personalities on return from the kernel. Various approaches to "fix" or alleviate this problem weren't particularly pretty.
Recently, Zach Brown started playing with something he calls acall. The user interface is pretty neat, and (like the above syslet etc like implementations), it allows for performing all system calls in an async nature. Zach took a more simplified approach for making it async in the kernel, by punting every operation to a thread in the kernel. This obviously works and means that the submission interface is very fast, which is of course a good thing. It also means that some operations are going to be performed by someone else than the process that requested the operation, which has implications for IO scheduling in the system. Advanced IO schedulers like CFQ tracks IOs on a per-process basis, and they then need the specific process context for both performance and accounting reasons. Last year I added a CLONE_IO flag for sharing the IO context across processes for situations like this, so this part is actually easily fixable by just using that clone flag for creation of the kernel worker thread. Obviously, doing a fork() like operation for every async system call isn't going to scale, so some sort of thread pooling must be put in place for speeding it up. Not sure what Zach has done there yet (I don't think a version with that feature has been released yet), but even with that in place there's still going to be identity fiddling when a thread is taking over work from a user space process. Apart from the overhead of juggling these threads, there's also going to be a substantial increase in context switch rates with this type of setup. And this last bit is mostly why I don't think the acall approach will end up performing acceptably for doing lots of IO, while it does seem to be quite nice for the more casual async system call requirements.
A third and final possibility also exists, and this is what I have been trying to beat into submission lately. Back in the very early 2.6 kernel days, Suparna Bhattacharya led an effort to add support for buffered IO to the normal fs/aio.c submission path. The patches sat in Andrews -mm tree for some time, before they eventually got dropped due to Andrew spending too much time massaging them into every -mm release. They work by replacing the lock_page() call done for waiting IO completion to a page with another helper function - I call it lock_page_async() in the current patches. If the page is still locked when this function is called, we return -EIOCBRETRY to the caller, informing him that he should retry this call later on. When a process in the kernel wants to wait for some type of event, a wait queue is supplied. When that event occurs, the other end does a wake up call on that wait queue. This last operation invokes a callback in the wait queue, which normally just does a wake_up_process() like call to wake the process blocked for this event. With the async page waiting, we simply provide our own wait queue in the process structure and the callback given from fs/aio.c then merely adds the completed event to the aio ringbuffer associated with the IO context for that IO and process.
This last approach means that we have to replace the lock_page() calls in the IO path with lock_page_async() and be able to handle an "error" return from that function. But apart from that, things generally just work. The original process is still the one that does the IO submission, thus we don't have to play tricks with identity thefts or pay the large context switch fee for each IO. We also get readahead. In other words, it works just like you expect buffered IO to work. Additionally, the existing libaio interface works for this as well. Currently my diffstat looks like this:
19 files changed, 445 insertions(+), 153 deletions(-)
for adding support for the infrastructure, buffered async reads, and buffered async writes for ext2 and ext3. That isn't bad, imho.
Initial performance testing looks encouraging. I'll be back with more numbers and details, as progress and time permits!
Dec. 19th, 2008
10:01 am - NAPI like approach for block devices
In continuing the quest for higher performance on fast IO devices, I did a quick'n dirty NAPI like implementation for block devices. Like the networking equivalent, it punts work to a softirq and does some polling from that to reap further events from the device. The benefit here being reaping more completion events per hardware interrupt, while also making the interrupt path faster. Initial results were encouraging, I got a few percent higher IOPS rate but at a higher sys time cost. So lets look into why that is.
Now, Linux IO completions already use a softirq for handling the bottom half part of the processing. In crappy ascii callgraph layout, it looks something like the below. The example uses AHCI to give you a full path from start to finish, other drivers vary slightly at the start of course.
IRQ triggers
ahci_interrupt()
ahci_port_intr()
ata_qc_complete_multiple()
ata_qc_complete()
qc->complete_fn() (ata_scsi_qc_complete() for normal IO)
command->scsidone() (scsi_done() for normal IO)
blk_complete_request()
blk_complete_request() triggers a call of BLOCK_SOFTIRQ and this completes the top half of the IO processing. Now the softirq triggers:
BLOCK_SOFTIRQ triggers
scsi_softirq_done()
scsi_finish_command()
scsi_io_completion()
scsi_end_request()
blk_end_request()
And here blk_end_request() does the bio unrolling and completes all parts of the bio's in the request, calls the registered bio completion notifier, and finally completes the request and frees it.
The NAPI prototype registered a private softirq and hooked in at the very top of the callgraph, altering it to look something like this:
IRQ triggers
ahci_interrupt()
blk_napi_sched_port() (arms BLOCK_NAPI_SOFTIRQ)
BLOCK_NAPI_SOFTIRQ triggers
blk_napi_softirq()
ahci_port_intr()
ata_qc_complete_multiple()
ata_qc_complete()
...
and so on, follows the same calltrace as the original example. Then it hits blk_complete_request() and another softirq is then raised to handle the original bottom half. blk_napi_softirq() will call into ahci_port_intr() until it does no work, or its budget of work has been used up.
This is indeed suboptimal, since we now have to raise two softirqs for completions. The reason it's still a win is because we handle a lot more than 1 request per hard interrupt, but there's still some ground to be gained here. Next I'll work on making the top half really small and using just the single softirq, essentially making the top half the same as the NAPI one listed above, but eliminating the softirq from the NAPI bottom half and just completing everything from the BLOCK_NAPI_SOFTIRQ path.
Dec. 16th, 2008
10:23 am - Sorting algorithms
| Algorithm | It | Et | Tt |
| AVL | 1.16 | 2.13 | 1.50 |
| Fibheap | 0.57 | 4.05 | 1.82 |
| Skiplist | 2.15 | 2.67 | 2.34 |
| List | 1.56 | 0.18 | 1.07 |
| Rbtree | 1.00 | 1.00 | 1.00 |
| Algorithm | It | Et | Tt |
| AVL | 0.92 | 2.70 | 1.28 |
| Fibheap | 0.52 | 7.19 | 1.87 |
| Skiplist | 1.82 | 3.29 | 2.11 |
| List | 0.26 | 0.31 | 0.27 |
| Rbtree | 1.00 | 1.00 | 1.00 |
To start from the bottom, it's no real surprise that a list insert is the fastest one for data that arrives pretty much completely sorted. The insertion will only be a check of the back of the list followed by an insert, a O(1) operation. Extraction is always O(1) from lists, so extraction will always be impossible to beat. AVL is pretty close to rbtrees in performance, but loses on all accounts. So we can rule that one out. Ditto skiplists, they always perform worse. The fibonacci heap does exceptionally well on insert always, but extraction is slooow. It's also pretty complex implementation wise, so we'd need a lot of justification to add something like that to the block layer.
It appears that the previous choice of rbtrees wasn't totally off. They perform acceptably for sequential IO and well for random IO. If we enlarge the request pool, rbtrees would do much better than list insert for random IO. Memory consumption is also about the same. The node size for rbtrees is sizeof(unsigned long) bigger than for the list, close enough that it doesn't matter in real life.
So I don't plan to change from rbtrees anytime soon. I may consider a hybrid structure of some sort that just avoids doing a lot of work for sequential IO, since we can detect that. Then we only do real sorting on more random input, and then the rbtree will always be a win. For non-rotating devices, a non-sorting list will be utilized. This should be a big win for high IOPS rate SSD like devices.
Dec. 10th, 2008
03:13 pm - Oracle and Linux
It's no secret that I work for Oracle and have been doing so for over 2 years now. I post emails with my Oracle email address every day and every commit that I put in the Linux kernel is tagged and signed-off with that email as well. So why am I writing this? Well, something has been annoying me for quite some time now - why does Oracle not get the recognition it deserves in the Linux development community? Most of the time when I see Oracle Linux projects mentioned in some article, there's always some snide comment on that particular association.
Oracle pays for all the work I do for the kernel. Some of that is directly related to what Oracle wants to have improved in the kernel, which is actually why I went to work for them in the first place. I mainly work on IO related projects, which is something that Oracle has a great interest in (I'm sure that's not a surprise to anyone!). But quite a lot of work is also very geared towards just making Linux better in a variety of ways and areas. I spend time on general block layer maintenance, integrating and testing patches from other contributors which isn't necessarily directly beneficial to my employer. And I'm not the only mainline kernel contributor. Chris Mason is spending a LOT of time working on btrfs, which promises to be a great next generation file system for Linux in general. Zach Brown is working on CRFS which looks like it's going to be a very interesting cache coherent remote file system. Randy Dunlap does a lot of janitorial, testing, and documentation work. There are others as well, I'm only mentioning the most community known people here.
Even the kernel commit/changes statistics that are posted every now and then show Oracle being safely in the top-10 as a contributor, yet they are routinely forgotten when people think of the big contributors. It's just a mystery to me, I wish our Linux branding was better!
Dec. 9th, 2008
01:26 pm - SSD optimizations
A few months ago I got my hands on a sample of the Intel X25-E SSD drives for testing purposes. There are lots of interesting things to be said about SSD vs rotational devices, but my main interest in these devices is largely that they offer a good test base for high IOPS rates testing. So yes, the device read/write bandwidth definitely kicks a lot of ass, in fact so much that it's the first SSD out there that I would recommend to other people. I've played with various other models in the past, even expensive ones. And while they do have nice read performance, they fall flat on their face when presented with random write workloads. Additionally, apart from the flash itself, the drive internals are really outdated - most are using SATA-I 1.5gbps link speeds, and none of them offer any kind of command queuing. What are they thinking?
Anyway, back to the topic at hand. As my initial primary goal with this device was increasing the IOPS rate of the Linux IO stack, testing was mainly done with O_DIRECT libaio. As with most of my testing, I tend to use fio for quickly setting up a prototype workload. I wanted to use small block sizes for various reasons. This brings the IOPS rate up, thus better high lighting problem areas in Linux that would tend to be hidden more with larger block sizes. My concern with O_DIRECT and small block sizes is that get_user_pages() overhead would dominate the per-command overhead, however those concerns turned out to be completely unfounded after Nick Piggins fast path get_user_pages() patches have gone in. That is nice, since I didn't really want to spend my time optimizing the VM for this!
The first bottleneck that reared its ugly head was hpet read in the kernel. It was so large that switching to a tsc based clock source for the kernel increased the IOPS rate by over 10%. After a bit of head scratching, I started looking at how many gettimeofday() calls that fio generates for this workload - a LOT. Fio wants to tell you pretty much everything about IO latencies, both going in to the kernel, at the device end, and reaping them. This in turn generates a lot of gettimeofday() traffic. A few hacks and options were put into fio to diminish the number of gtod calls and things looked much better with the default clock source.
To step back a bit before going further, one usually has a preconceived notion of a few changes that'll speed things up before doing this type of testing. One such thing was the IO plug management in Linux. Plugging is meant to optimize the command size for the device, at the expense of a bit of latency on submission. This makes a lot of sense for rotational storage, not so much for SSD devices. Turns out that disabling plugging gains us about 1% on just this device, when doing 30k-40k IOPS.
After disabling plugging, I went further with the profiling. The top entry is (not surprising) the ahci interrupt handler. There are ways to speed it up a bit without going too far, but my goal was IO stack reduction in general, so I left it alone for now. Next on the list was slab allocations. When issuing a piece of IO in Linux, we do quite a few allocations along the way. First we allocate a bio, then we allocate a bio_vec list to fill pages into. The bio_vec allocations come from a number of slab pools, sized in powers of two (1 entry, 4, 16, 64, 128, and 256). Then a request is allocated to attach this bio to. Going into the driver layers, SCSI allocates a SCSI command and sense buffer before sending it to the device. So that's 5 allocations just for the IO request, and there can easily be more from files ystems etc. We can't get away with not allocating a bio, but we can eliminate at least some of bio_vec allocations fairly easily. Most of the IO issued in a system is fairly small, in the range of 4 - 16kb. If we embed the bio_vec into the bio for a small number of pages, we can get rid of this allocation. To take that step a bit further, I wanted to incorporate a related change that Chris Mason had previously voiced an interest in. When a file system allocates a bio, it typically allocates a private structure for information related to that piece of IO. So the patch set in full allows for a piece of memory to be reserved both at the front of the bio (for file systems) and at the end (for the bio_vec). This then combines these three allocations into one. A similar trick was done for the SCSI command, so that the sense buffer is allocated at the very end as well. Finally, I added a struct request allocation cache to avoid going into the memory allocator there as well. The end result is that we gained 3-4% for this workload.
Another entry that was high in the profile list was lookup_ioctx(). This is what finds the kernel aio context for a given process when that process does an aio io_submit() to submit a piece of IO. When I read the kernel implementation, several red lights whent off in my head. The lookup was a doubly linked list, guarded by a read/write spinlock. While the O(n) choice of a linked list may seem problematic, there's usually only a single entry on this list since processes generally do not set up a lot of IO contexts to submit IO against. But experience tells me that reader/write locks are usually trouble, as they are much slower than a normal spinlock. Personally I think the use of them is usually a design mistake and that it would be better if we removed them from the kernel! The list management in aio was opencoded, so I converted that to a hlist structure and protected it with RCU. Using RCU here makes a lot of sense, since the list is basically only manipulated when the IO context is setup or torn down - for the long duration of actually submitting IO, it's purely a reader side thing. This got us about 2% gain on even a puny 2-way system.
At this years kernel summit, I had a talk with Matthew Wilcox about a recent change to the IO accounting that Intel had found troublesome. In the 2.4 days, we had per-partition statistics on the IO request, while for 2.6 kernels we had dumped that feature. About a year ago that feature was reinstated. The partition lookup scans the kernel partition table to find the right partition to account, and if that partition number is in the higher range, we end up spending a bit of time doing this for every IO. You don't notice if you are primarily doing IO to sda1 or sda2, but if sda16 is hit a lot it starts to show. I added a one-hit cache in front of this lookup and got rid of most of that lookup time. This trick is similar to what we do for IO merging, and it works exceptionally well. The reason being that while you typically have more than one partition active for IO at any given point in time, submissions tend to come in batches. If these batches are large enough, we get a lot of hits in the one-hit cache before having to invalidate it. So this again got us a few percent of improvement for this type of scenario.
I'll stop detailing the optimizations here and save some for future blog entries, there's still lots of improvements to be made and I have lots of stuff in-progress that I hope will be very interesting. And I haven't even gotten to buffered IO yet! If you are curious about the above changes, you are encouraged to inspect the for-2.6.29 branch of my block git repo. Find that here. In the ssd branch there are more experimental changes that will need a bit longer to mature.
