Jens Axboe's blog - Buffered async IO
Jan. 19th, 2009
03:19 pm - Buffered async IO
We have this infrastructure in the kernel for doing asynchronous IO, which sits in fs/aio.c and fs/direct-io.c mainly. It works fine and it's pretty fast. From userspace you would link with -laio and use the functions there for hooking into the syscalls. However, it's only really async for direct uncached IO (files opened with O_DIRECT). This last little detail essentially means that it's largely useless to most people. Oracle uses it, and other databases may as well. But nobody uses aio on the desktop or elsewhere simply because it isn't a good fit when O_DIRECT is required - you then need aligned IO buffers and transfer sizes and you lose readahead. The alignment and size restrictions also make it difficult to convert existing apps to use libaio with O_DIRECT, since it requires more than just a straight forward conversion. It adds complexity.
Over the years, several alternative approaches to async IO have been proposed/developed. We've had several variants of uatom/syslet/fibril/threadlet type schemes, which all boil down to (just about) the same type of implementation - when a process is about to block inside the kernel, a cloned process/thread returns to userspace on behalf of the original process and informs it of the postponed work. The completed events can then later be retrieved through some sort of get_event/wait_event type interface, similar to how you reap completions with other types of async IO. Or the implementation provided a callback type scheme, similar to how eg a signal would be received in the process. This sort of setup performed acceptably, but suffered from the schizophrenic disorder of split personalities due to the change in personalities on return from the kernel. Various approaches to "fix" or alleviate this problem weren't particularly pretty.
Recently, Zach Brown started playing with something he calls acall. The user interface is pretty neat, and (like the above syslet etc like implementations), it allows for performing all system calls in an async nature. Zach took a more simplified approach for making it async in the kernel, by punting every operation to a thread in the kernel. This obviously works and means that the submission interface is very fast, which is of course a good thing. It also means that some operations are going to be performed by someone else than the process that requested the operation, which has implications for IO scheduling in the system. Advanced IO schedulers like CFQ tracks IOs on a per-process basis, and they then need the specific process context for both performance and accounting reasons. Last year I added a CLONE_IO flag for sharing the IO context across processes for situations like this, so this part is actually easily fixable by just using that clone flag for creation of the kernel worker thread. Obviously, doing a fork() like operation for every async system call isn't going to scale, so some sort of thread pooling must be put in place for speeding it up. Not sure what Zach has done there yet (I don't think a version with that feature has been released yet), but even with that in place there's still going to be identity fiddling when a thread is taking over work from a user space process. Apart from the overhead of juggling these threads, there's also going to be a substantial increase in context switch rates with this type of setup. And this last bit is mostly why I don't think the acall approach will end up performing acceptably for doing lots of IO, while it does seem to be quite nice for the more casual async system call requirements.
A third and final possibility also exists, and this is what I have been trying to beat into submission lately. Back in the very early 2.6 kernel days, Suparna Bhattacharya led an effort to add support for buffered IO to the normal fs/aio.c submission path. The patches sat in Andrews -mm tree for some time, before they eventually got dropped due to Andrew spending too much time massaging them into every -mm release. They work by replacing the lock_page() call done for waiting IO completion to a page with another helper function - I call it lock_page_async() in the current patches. If the page is still locked when this function is called, we return -EIOCBRETRY to the caller, informing him that he should retry this call later on. When a process in the kernel wants to wait for some type of event, a wait queue is supplied. When that event occurs, the other end does a wake up call on that wait queue. This last operation invokes a callback in the wait queue, which normally just does a wake_up_process() like call to wake the process blocked for this event. With the async page waiting, we simply provide our own wait queue in the process structure and the callback given from fs/aio.c then merely adds the completed event to the aio ringbuffer associated with the IO context for that IO and process.
This last approach means that we have to replace the lock_page() calls in the IO path with lock_page_async() and be able to handle an "error" return from that function. But apart from that, things generally just work. The original process is still the one that does the IO submission, thus we don't have to play tricks with identity thefts or pay the large context switch fee for each IO. We also get readahead. In other words, it works just like you expect buffered IO to work. Additionally, the existing libaio interface works for this as well. Currently my diffstat looks like this:
19 files changed, 445 insertions(+), 153 deletions(-)
for adding support for the infrastructure, buffered async reads, and buffered async writes for ext2 and ext3. That isn't bad, imho.
Initial performance testing looks encouraging. I'll be back with more numbers and details, as progress and time permits!