Last Day at Oracle

Today is a strange day. It's officially my last day at Oracle, but
since I handed in everything on Friday, I already feel disconnected
from the team. Test equipment is gone and without the VPN box, network
connectivity has also been shut down. When you work from at home
every day, internal irc becomes the virtual office. So while I'm
sitting in the same chair that I do every day, it still feels different.

I joined Oracle in October 2006 and I've enjoyed working there
very much. It's a great place to be and I can highly recommend it
to others. The Linux kernel group is compromised of some really
good guys, I will miss them a lot. Thankfully this is the world
of open software, so I know that I'll be seeing them at
conferences in the future.

Tomorrow I will join Fusion-IO and enter the world of fast flash
storage. I'll be keeping my Linux kernel hat on, so from a community
stand point not much will change. I'm excited about also working
on a real physical product that you can hold in your hand. I will
now finally be able to show my sons what it is that dad is working on,
which is a great feeling. And family and friends, for that matter.
Or perhaps not, but at least a pci-express card is more tangible
than esoteric bits of software that you cannot even see when
you're in front of the computer.

I'm sure it'll be an exciting time!

F5100 and IOPS

I finally got everything setup and wired up. The box now has 80 modules installed and I'm using 4 SAS controllers which each connects to two different expanders. The last part is the most problematic, as the 3801 LSI controllers max out at roughly 110K IOPS. That's pretty lame, and I hear the newer controllers do > 300K IOPS which is more reasonable. Not really a huge surprise, I've been calling HBA firmware piggy and slow for years, seems the SSD craze finally got them kicked into gear. But it's a shame, since that leaves 75% of the available IOPS on the floor. I have 2 more controllers coming in (thanks Eric!), so that'll up it a bit, at least.

To test the IOPS rate, I wrote a simple fio job file that does 4K random O_DIRECT reads at depth 32. Initial results were pretty shabby. I got ~200K with a number of devices, adding more made it drop down to ~150K. I was expecting 400K at least, so this was a worry. Some quick profiling didn't show much of interest: some locking overhead, but it is hard to quantify just how much. The test box has 32 cores / 64 threads, which is really nice for testing, but sometimes makes profiling a bit more difficult since the high CPU count has a tendency to mask some issues.

Booting with only 4 CPUs enabled was much better; I got 430K IOPS easily. Interestingly, the rq completion affinity knob (which I've blogged about before, merged in 2.6.29 and enabled by default in 2.6.32) makes a big difference. Disabling that and the IOPS rate drops to ~250K.

Now to find out why we suck at 64 CPUs...

New storage toy

Since I do a lot of work on the IO side of the Linux kernel, I have an assortment of various types of disk drives and storage. I've been lucky enough to receive various types of SSD drives, which enabled me to research and test new ideas for these types of devices. I love doing performance work and squeezing the most out of what you have, and the performance characteristics of SSD devices rekindled that interest.

Earlier this week I received a new storage toy that'll take this to a whole new level - a Sun F5100 flash array. This beast packs 80 SLC flash modules (or FMods, in Sun speak) into a 1U box. Each module is essentially identical to the ones on the F20 card, which I already have used a lot for testing. It's 24GB in size, has a sequential throughput of approximately 200MB/sec reads and 150MB/sec writes, or 25K/15K IOPS. That's pretty fast by itself, multiply that by 80 and you are now burning holes in the fabric of time - 1.6 million IOPS, or well over 10GB/sec throughput.



Internally the modules are connected through SAS expanders, of which there are 4 on the back, each with 4 ports. Ideally you'd connect 16 controllers to reach the full potential of the box, setting up zoning to drive 5 modules per port. The box essentially has its own UPS with 4 energy modules at the front, so you never need to use barriers on this device. And it's running surprisingly quiet, which is always a big plus for things you have a few meters away from the desk :-)

Currently the box has 40 modules installed, 40 more are on the way. I have just a single SAS controller, more are on their way as well. Once the additional controllers arrive, I'll be putting it through the paces and posting some numbers. I have some soft multi-queue block IO support in progress that has been waiting for a potent device to be unleashed. This will be interesting...

LPC 2009 talk

I gave a talk on the new per-bdi writeback threads at the Linux Plumbers Conference today, which I think went pretty well. Since I've had various people ask me for the slides, here they are.

Cheaper SSD reliability, continued

So after twice promising me to get info from 'an engineer' in a 10 day time span, I pushed OCZ again today. The answer is that it's likely "bad blocks" on the drive and they offered to exchange it. Now, I don't know the internal secret sauce to their flash chip striping, but a bad flash block may explain the issue. And personally I care a lot less about this specific drive than the larger issue at hand, which is: Can we trust these SSD drives? From this experience, the answer so far is unequivocally no. If the flash block/page/erase block is indeed part (or partially bad), I want to know! Don't just send me all ones in the data, that's not acceptable. If their drive/fw doesn't error handling, I'm quite sure the customers would like to know this fact.

Apparently Indilinx does the firmware for these drives for at least two manufacturers. Given that the SSD consumer market is steaming ahead at this point, I'm guessing there's a huge rush to reduce the time to market. A safe bet would  be that the firmware is perhaps a little too rushed in this case. Coincidentally, the bad drive is running fw 1.10. Version 1.30 lists this little juicy fix among the others: "Read fail handling".  I'll try 1.30 on this drive and re-read the data, just to see what happens.

For the time being, I can't recommend using Indilinx based drives anywhere except for throw away data. If they can't even tell you when the data has gone bad, then they really can't be used for much else. At least use btrfs with data checksums enabled, then you could catch a problem like this. Yes I did run btrfs on my Vertex, and yes I did disabled datacsum to avoid the extra CPU use on my laptop... My personal recommendation would be to stick with Intel or Samsung SSD drives where data matters.

Cheaper SSD reliability?

In earlier blog entries, I praised the Intel X25-E for its performance. I also have high hopes for the reliability of the drive. By reliability, I refer to data integrity as well as endurance. It's not that I have much information to back this up, but I know that Intel have put a lot of testing into them.

So while the X25-E is extremely nice, it's also very pricey. Recently I needed a few more SSD's for testing purposes, and despite public begging on this blog, Intel hasn't sent me any more drives. As I'm sure most are well aware, the cheaper SSD drives were mostly utter crap. Even most expensive SSD drives have been crap, mostly due to using that infamous JMicron flash controller that would have done more good as nice sand on the beach instead of being manufactured into silicon. Now there are other alternatives though, and the Indilinx controller looked like a good option. OCZ recently introduced a Vertex series that uses this controller. Not only does it perform better, it's also not 80s ATA tech. It has NCQ and TRIM support, which is very nice.

I went out and bought a few of these for testing. One I put in the laptop and the other in a test box. Performance is good, even random 4kb writes actually work. This is where the crap SSD's fall apart. However, as opposed to the Intel drive, I didn't have a lot of faith in the reliability of these drives. Early firmwares were plagued with errors, and even the just released v1.30 firmware fixes issues that seem like rather basic functionality. An example of that would be mishandling ATA commands with zero sector count. But I decided to give them the benefit of the doubt.

A few days ago, I was working on the laptop at night as usual. Pushing out a few changes from my block git repository, git complained of a corrupt pack file. The pack file in question was from when I lasted repacked the repo back in February. It's read-only, about 380MB in size, and thus hasn't been written to since it was created some 4 months ago. I usually don't keep backups of my laptop data, since it's just a development environment and all my source is safe with git on a public server. As it just so happened, I had tested the new btrfs format a week earlier. In doing so, Chris asked me to keep an image of the drive so we could debug any potential problems with the new format. So I went and fetched the pack file from the backup and compared the two. The backup file was, as expected, fine. Looking into the nature of the corruption (basically finding out who to blame for the corruption), I found out that the corruption started 64519680 bytes into the file. So that's nicely 512b aligned, but not 4kb aligned. The corruption spanned 16KB in total. So far, so good. What I found out next was even more interesting: every other byte in file was correct, every other byte was 0xff!

That type of corruption just reeks of drive problems. I reported this issue to OCZ about a week ago. First level support quickly replied and passed the issue on to the engineers, but I have yet to hear anything from that side. I've kept the drive as-is if they want to inspect it. I'm not keeping my hopes up though, and I'm glad I'm not the OCZ engineer tasked with fixing it.

Meanwhile I put the other Vertex in the laptop and recreated my git tree. No issues seen so far, but suffice to say that my confidence level in these drives aren't that high. I'll be keeping backups if I put anything interesting on the laptop!

pdflush epitaph

It seems it has been about 2 months since I last posted here. That's not due to lack of kernel activity, though real life has interfered a bit with the addition of one more son to the family.

The patch set has been undergoing some changes since I last posted. One is the ability to have more than one thread per backing device. This is supposed to be useful for extreme cases where a single CPU cannot keep up with a very fast device. I have yet to actually test this part, I'm hoping some of the interested parties will join the fun and add the file system related code that enables placement and flushing of dirty inodes on several writeback threads per bdi. Another change is lazy create/exit of flusher threads. pdflush has 2-8 threads running depending on what mood it is in. With the per-bdi flusher threads, they will not get created unless they are going to be working. If they have been idle for some time, they will exit again. So this should more smoothly respond to actual system demands, not much point in having 100 idle threads for 100 disks, if only a fraction of those disks are actually writeback busy in a period of time.

I've also done a bit of testing this week, results look pretty good. Most show the new approach reaching similar performance but at a lower system utilization percentage, or higher performance. So that's all good. Yanmin Zhang (Intel) ran into a bug (that may or may not already be fixed, I'll know tomorrow when tests are run with new patches) and posted a fio job file that he reproduced it with. I decided to run the test with and without the writeback patches to compare results. The disk used is an 32G Intel X25-E SSD and the file system is ext4.

KernelThroughput
usr CPUsys CPUdisk util
writeback175MB/sec17.55%43.04%97.80%
vanilla147MB/sec13.44%47.33%85.98%

Pretty decent result, I'd say. Apart from the lower system utilization, the interesting bit is how the writeback patches actually enable us to keep the disk busy. ~86% utilization for the vanilla kernel is pretty depressing. The fio job file used was:

[global]
direct=0
ioengine=mmap
iodepth=256
iodepth_batch=32
size=1500M
bs=4k
pre_read=1
overwrite=1
numjobs=1
loops=5
runtime=600
group_reporting
directory=/data

[job_group0_sub0]
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
startdelay=0
rw=randwrite
filename=f1:f2


The next few days will be spend with polishing the patch set and posting version 5. That one should hopefully be ready for inclusion in the -next tree, and then be headed upstream for 2.6.31.

Oh, and that Intel disk kicks some serious ass. For sequential writes, it maintains 210MB/sec easily. I have a few OCZ Vertex disks as well which do pretty well for sequential writes too, for random writes the Intel drive is in a different league though. For my birthday, I want 4 more Intel disks for testing!

Burying pdflush in the back yard

So this week I began playing with implementing an alternative approach to buffered writeback. Right now we have the pdflush threads taking care of this, but that has a number of annoying points that made me want to try something else. One is that writeout tends to be very lumpy, which is easily visible in vmstat. Another is that it doesn't play well with the request allocation scheme, since pdflush backs off when a queue becomes congested. And fairness between congested users and blocking users is... well not there.

Enter bdi threads. The first step was moving the dirty inodes to some place where bdi threads could easily get at them. So instead of putting them on the super_block lists, we put them on the bdi lists of similar names. One upside of this change is also that now we don't have to
do a linear search for the bdi, we have it upfront. The next step is forking a thread per bdi that does IO. My initial approach simply created a kernel thread when the bdi was registered, but I'm sure that lots of people will find that wasteful. So instead it now registers a forker thread on behalf of the default backing device (default_backing_dev_info), which takes care of creating the appropriate threads when someone calls bdi_start_writeback() on a bdi. It'll handle memory pressure conditions as well, find the details in the patch set.

Initial tests look pretty good, though I haven't done a whole lot of testing on this yet. It's still very fresh code. I posted it on lkml today, you can find the individual patches and complete description there. As always, the patches are also in my git repo. Find them in the writeback branch here.

Buffered async IO

We have this infrastructure in the kernel for doing asynchronous IO, which sits in fs/aio.c and fs/direct-io.c mainly. It works fine and it's pretty fast. From userspace you would link with -laio and use the functions there for hooking into the syscalls. However, it's only really async for direct uncached IO (files opened with O_DIRECT). This last little detail essentially means that it's largely useless to most people. Oracle uses it, and other databases may as well. But nobody uses aio on the desktop or elsewhere simply because it isn't a good fit when O_DIRECT is required - you then need aligned IO buffers and transfer sizes and you lose readahead. The alignment and size restrictions also make it difficult to convert existing apps to use libaio with O_DIRECT, since it requires more than just a straight forward conversion. It adds complexity.

Over the years, several alternative approaches to async IO have been proposed/developed. We've had several variants of uatom/syslet/fibril/threadlet type schemes, which all boil down to (just about) the same type of implementation - when a process is about to block inside the kernel, a cloned process/thread returns to userspace on behalf of the original process and informs it of the postponed work.  The completed events can then later be retrieved through some sort of get_event/wait_event type interface, similar to how you reap completions with other types of async IO. Or the implementation provided a callback type scheme, similar to how eg a signal would be received in the process.  This sort of setup performed acceptably, but suffered from the schizophrenic disorder of split personalities due to the change in personalities on return from the kernel. Various approaches to "fix" or alleviate this problem weren't particularly pretty.

Recently, Zach Brown started playing with something he calls acall. The user interface is pretty neat, and (like the above syslet etc like implementations), it allows for performing all system calls in an async nature. Zach took a more simplified approach for making it async in the kernel, by punting every operation to a thread in the kernel. This obviously works and means that the submission interface is very fast, which is of course a good thing. It also means that some operations are going to be performed by someone else than the process that requested the operation, which has implications for IO scheduling in the system. Advanced IO schedulers like CFQ tracks IOs on a per-process basis, and they then need the specific process context for both performance and accounting reasons. Last year I added a CLONE_IO flag for sharing the IO context across processes for situations like this, so this part is actually easily fixable by just using that clone flag for creation of the kernel worker thread. Obviously, doing a fork() like operation for every async system call isn't going to scale, so some sort of thread pooling must be put in place for speeding it up. Not sure what Zach has done there yet (I don't think a version with that feature has been released yet), but even with that in place there's still going to be identity fiddling when a thread is taking over work from a user space process. Apart from the overhead of juggling these threads, there's also going to be a substantial increase in context switch rates with this type of setup. And this last bit is mostly why I don't think the acall approach will end up performing acceptably for doing lots of IO, while it does seem to be quite nice for the more casual async system call requirements.

A third and final possibility also exists, and this is what I have been trying to beat into submission lately. Back in the very early 2.6 kernel days, Suparna Bhattacharya led an effort to add support for buffered IO to the normal fs/aio.c submission path. The patches sat in Andrews -mm tree for some time, before they eventually got dropped due to Andrew spending too much time massaging them into every -mm release. They work by replacing the lock_page() call done for waiting IO completion to a page with another helper function - I call it lock_page_async() in the current patches. If the page is still locked when this function is called, we return -EIOCBRETRY to the caller, informing him that he should retry this call later on. When a process in the kernel wants to wait for some type of event, a wait queue is supplied. When that event occurs, the other end does a wake up call on that wait queue. This last operation invokes a callback in the wait queue, which normally just does a wake_up_process() like call to wake the process blocked for this event. With the async page waiting, we simply provide our own wait queue in the process structure and the callback given from fs/aio.c then merely adds the completed event to the aio ringbuffer associated with the IO context for that IO and process.

This last approach means that we have to replace the lock_page() calls in the IO path with lock_page_async() and be able to handle an "error" return from that function. But apart from that, things generally just work.  The original process is still the one that does the IO submission, thus we don't have to play tricks with identity thefts or pay the large context switch fee for each IO. We also get readahead. In other words, it works just like you expect buffered IO to work. Additionally, the existing libaio interface works for this as well. Currently my diffstat looks like this:

    19 files changed, 445 insertions(+), 153 deletions(-)

for adding support for the infrastructure, buffered async reads, and buffered async writes for ext2 and ext3. That isn't bad, imho.
Initial performance testing looks encouraging. I'll be back with more numbers and details, as progress and time permits!

NAPI like approach for block devices

In continuing the quest for higher performance on fast IO devices, I did a quick'n dirty NAPI like implementation for block devices. Like the networking equivalent, it punts work to a softirq and does some polling from that to reap further events from the device. The benefit here being reaping more completion events per hardware interrupt, while also making the interrupt path faster. Initial results were encouraging, I got a few percent higher IOPS rate but at a higher sys time cost. So lets look into why that is.

Now, Linux IO completions already use a softirq for handling the bottom half part of the processing. In crappy ascii callgraph layout, it looks something like the below. The example uses AHCI to give you a full path from start to finish, other drivers vary slightly at the start of course.

IRQ triggers
    ahci_interrupt()
        ahci_port_intr()
            ata_qc_complete_multiple()
                ata_qc_complete()
                    qc->complete_fn() (ata_scsi_qc_complete() for normal IO)
                        command->scsidone() (scsi_done() for normal IO)
                            blk_complete_request()

blk_complete_request() triggers a call of BLOCK_SOFTIRQ and this completes the top half of the IO processing. Now the softirq triggers:

BLOCK_SOFTIRQ triggers
    scsi_softirq_done()
        scsi_finish_command()
            scsi_io_completion()
                scsi_end_request()
                    blk_end_request()

And here blk_end_request() does the bio unrolling and completes all parts of the bio's in the request, calls the registered bio completion notifier, and finally completes the request and frees it.

The NAPI prototype registered a private softirq and hooked in at the very top of the callgraph, altering it to look something like this:

IRQ triggers
    ahci_interrupt()
        blk_napi_sched_port() (arms BLOCK_NAPI_SOFTIRQ)

BLOCK_NAPI_SOFTIRQ triggers
    blk_napi_softirq()
        ahci_port_intr()
            ata_qc_complete_multiple()
                ata_qc_complete()
                    ...

and so on, follows the same calltrace as the original example. Then it hits blk_complete_request() and another softirq is then raised to handle the original bottom half. blk_napi_softirq() will call into ahci_port_intr() until it does no work, or its budget of work has been used up.

This is indeed suboptimal, since we now have to raise two softirqs for completions. The reason it's still a win is because we handle a lot more than 1 request per hard interrupt, but there's still some ground to be gained here. Next I'll work on making the top half really small and using just the single softirq, essentially making the top half the same as the NAPI one listed above, but eliminating the softirq from the NAPI bottom half and just completing everything from the BLOCK_NAPI_SOFTIRQ path.