Triple M.2 Samsung 950 Pro Z170 PCIe NVMe RAID Tested - Why So Snappy?
Latency Distribution and Latency Percentile
To demonstrate the performance differences with these RAID configurations, I will be using an improved version of the Latency Distribution / Latency Percentile testing first introduced in our original 950 Pro review. I recommend briefly reviewing that page for an idea of how this testing works and its benefits in demonstrating large differences in latency (HDDs on the same chart as SSDs!). While the charts will be in a similar layout, they are now plotted at a higher resolution (600 points per line over the six decade logarithmic scale), as well as a number of behind-the-scenes changes to help clear up the presentation of the data.
The above chart is a bit busy and hard to read, but I've included it for proper perspective as to where the following charts have been derived. It is showing where each IO falls with respect to the time it took to be serviced (latency). The ideal result here would be a narrow peak as far to the left as possible, meaning that all IOs are serviced very quickly, with no stragglers causing unwanted delays. Lets quickly move onto the Latency Percentile translation of the above data, as well as the Percentile plots for a RAID of two and three 950 Pros:
You can click any of these graphs for a larger view!
With Latency Percentile, we see a 'neater' representation of the Latency Distribution data, in that we get a profile that slopes from 0% to 100% of the total IOs serviced by the SSD / array being tested. The ideal here is as steep of a slope as possible, and that slope shifted as far to the left as possible. If the slope tapers off, that means that a percentage of IOs are taking longer to be serviced, and where that percentage falls towards the right along the horizontal axis corresponds to the (higher) latency that those remaining IOs took to complete. You might have seen enterprise SSDs quote '99% latency' specs, meaning that 99% of the IOs fall at or below a specificed latency - this is the exact type of chart where such results would be derived from.
As you’ll note in the legend of the above charts, the IOPS does scale up as drives are added. Random write IOPS ramps nicely from 81k, to 174k with a second, and finally to just over 300k with the third 950 Pro added to the array. One of the main points I wanted to bring up in this article was that the reason an SSD ‘feels faster’ is not just because of the sheer IOPS increases – it is also due in large part to how the decrease in load seen by each SSD individually results in a major drop in overall IO latency.
To better demonstrate this, here is what it looks like when we test at a constant queue depth (QD=16) with varying numbers of 950 Pros:
You’ll note that the step decreases in latency percentile look similar to the percentile shifts seen in at the lower queue depths of our tests with fewer SSDs. As an additional data point, the 90th percentile point of that last chart showing the RAID spread at QD=16 shows latency dropping to 1/6 of the single SSD. Yes you are reading that correctly, the single SSD is taking *six times as long* to service the 90% most latent IO requests.
In simple terms, SSDs respond faster at lower ‘loads’, and running multiple SSDs in a RAID divides that load across them. You get the performance boost and IOPS scaling of a RAID, but you *also* get an overall reduction in latency for a given IO load. This effect is compounded further when you consider that the system has to do something with those IO’s it is requesting, and it is likely that a given application / CPU will not be able to request the multiplied ultimate IOPS of the RAID, meaning the array as a whole will run at a lower queue depth than a single SSD would. Since the queue is divided across the array, each SSD will be running at a queue depth even lower than its fraction of the array. An example to clarify: An application that reached QD=16 on a single SSD might only reach QD=9 on the array, driving the individual SSDs of a triple SSD RAID down to QD=3 (where the straight math would have given you QD=5.3).
Putting all of this together, running at an effectively lower Queue Depth (from the SSDs perspective) results in faster overall response of the array. Looking at the individual SSD latency results, running at those lower queue depths makes an even larger difference in IO consistency. The end result of this is a RAID of SSDs gives you a much greater chance of IOs being serviced as rapidly as possible, which accounts for that 'snappier' feeling experienced by veterans of SSD RAID.
Reads see the same type of effect, though it is less defined:
For the set of NVMe SSDs we were testing here, the spread was not as wide as it was on reads, but realize that we are on a logarithmic scale, and linear differences are less prevalent as you move to the right. There was still a 20% reduction in the 90th percentile latency across the spread at QD=16 in reads.
One additional point those with a keen eye would have noted is that in both cases (writes and reads, but reads less noticable due to the log scale), shifting from a single SSD to a RAID results in a ~6μs additional delay to *all* IO requests. This is the overhead cost of Intel's RAID implementation, and it represents things like the time taken to translate the IO addresses to the array. This added delay does have an impact at very low queue depths, but it is almost immediately outweighed by the increased 'acceleration' of the array as the queue depth climbs just a single point to overcome the effect.