To test the IOPS rate, I wrote a simple fio job file that does 4K random O_DIRECT reads at depth 32. Initial results were pretty shabby. I got ~200K with a number of devices, adding more made it drop down to ~150K. I was expecting 400K at least, so this was a worry. Some quick profiling didn't show much of interest: some locking overhead, but it is hard to quantify just how much. The test box has 32 cores / 64 threads, which is really nice for testing, but sometimes makes profiling a bit more difficult since the high CPU count has a tendency to mask some issues.
Booting with only 4 CPUs enabled was much better; I got 430K IOPS easily. Interestingly, the rq completion affinity knob (which I've blogged about before, merged in 2.6.29 and enabled by default in 2.6.32) makes a big difference. Disabling that and the IOPS rate drops to ~250K.
Now to find out why we suck at 64 CPUs...