The Micron 9100 MAX 2.4TB U.2 Enterprise SSD Review - P3700 Killer
PACED Latency Performance - Latency+IO Percentile and QoS
Required reading (some additional context for those unfamiliar with our Percentile testing):
- Introduction of Latency Distribution / Latency Percentile (now called IO Percentile)
- Introduction of Latency Weighted Percentile (now called Latency Percentile)
Intro to PACED workloads - 'It's not how fast you go, it's how well you go fast!'
All other reviewers (and even manufacturers) have historically measured performance based on a saturation load at a specified Queue Depth (QD). This, unfortunately, makes any direct latency and QoS comparisons against competing SSDs irrelevant, as each SSD is operating at a different IOPS. The only way to level the playing field in such a comparison is to remove that IOPS variable, which as it turns out is an extremely difficult thing to do correctly. Fortunately, I was able to build upon the coding groundwork laid by our Percentile testing and enable the application of a given workload in a paced manner. Some test applications (namely Iometer) have a crude form of pacing built in, but that is a ‘bursty’ test and is actually quite the opposite of what we are doing here. Our PACED workloads are issued as a steady stream of IOs to the SSD being tested. If we dial in 100,000 IOPS, the SSD will receive an additional IO request every 10 microseconds.
Since SSDs don’t typically service all IO requests within 10 microseconds, they will naturally ‘stack’, leading to an elevated operating QD. Our test is written in such a way that it allows the SSD to ‘float’ to whatever QD it needs to handle the load (which just so happens to be *exactly* what happens in real-world usage - funny how that works, isn’t it?). As an added bonus, we are able to measure the resulting QD, and will be providing it in the legend, replacing the spot previously occupied by IOPS.
To back up the cause for pacing further, you have to step back and realize that the typical enterprise storage server setup doesn't sit there all day pegging its SSDs to saturation at the highest possible QD. Data center managers keep an eye on their workloads and (smartly) deploy additional parallel storage to handle increases in total workload. They do this because racks full of saturated SSDs would consistently be exceeding the maximum latency specified in the Service Level Agreements with their customers. Some other data points:
- OLTP servers will be limited by the natural rate of incoming transactions, and if a server like that was completely saturating its storage, transactions would be timing out left and right.
- VMware's VSAN (used for VM storage needs) is self-limited to 10% CPU (page 56), meaning heavy IO loads will very likely be throttled before they reach the installed SSDs.
- File servers relying on flash storage will typically have a large array to achieve the desired total capacity, multiplying the potential SSD throughput far beyond 10Gbit or even 40Gbit network links.
One final note before we get into the results. Care must be taken when selecting the PACED configuration, as we must remain within the capability of the slowest SSD in the comparison. Oversaturating a given SSD will be immediately apparent by its QD skyrocketing, accompanied by a lower than expected total IOs logged at the end of the run. The maximum total IO mismatch delta seen in all of the following tests came out to 0.018%, so it’s safe to say that both SSDs were loaded identically. With that out of the way, onto the show:
I have chosen to lead here with Latency Percentile (formerly ‘Latency Weighted’), and will cover IO Percentile after the QoS results. Latency Percentile is a translation of the latency distribution that takes into account the time spent servicing the IOs, meaning the above plots are showing the percentage of the total run time spent on those IOs. The results are effectively weighed by latency, where longer IOs have a larger impact on the percentile. These results are *not* used for QoS calculations, since QoS assumes the IOs are all independent, meaning if one IO stalls, the rest will just keep on going without waiting for it. I nearly omitted Latency Percentile plots from this review until I noted that they paint a much clearer picture of the performance deltas seen throughout this review. In the above four (Latency Percentile) plots, Micron 9100 Pro was more clearly the winner with a far lower QD at load, yet the IO Percentile plots (bottom four on this page) make it look like a much closer battle than reality.
Commenting on the above results directly, the Micron 9100 Pro is a clear winner across the board. While the Intel P3700 has a lower minimum latency, the 9100 is able to service a higher overall percentage of them faster than the Intel part.
Quality of Service (QoS)
QoS is specified in percentages (99.9%, 99.99%, 99.999%), and uniquely spoken (‘three nines’, ‘four nines’, ‘five nines’). It corresponds to the latency seen at the top 99.x% of all recorded IOs in a run. Note that these comparative results are derived from IO Percentile data and *not* from Latency Percentile data. Thanks to PACED application of the workload, we can now validly compare QoS between competing SSDs (without varying IOPS as a factor):
The 9100 looks great in mixed PACED workloads, but those 100% random write results are damn impressive, with the P3700's 99.999% QoS coming in at 242x the latency of the 9100 MAX (yes, you read that right).
Funny story - the high-resolution latency distribution binning employed back when I started this whole percentile thing was actually done as a means to get more accurate QoS figures. It took nearly a year to come full circle, but those simple charts are actually the result of a *lot* of back-end work taking place to improve the accuracy of the results. A major player in the enterprise SSD market admitted to me that they derive their QoS specs from a 21-bin histogram, where interpolation must be used to deduce the latency at a given IO percentile. We use a 600-bin histogram (!), and while we still interpolate to even further increase accuracy, our resolution is high enough that we really don’t need to. Our custom built solution enables us to dial the number of bins as high as we have available memory in the system, but accuracy gains beyond the 600 figure proved negligible in testing. Moral of the story: Don't let the simplicity of the QoS charts fool you given the complexity that went into getting those few numbers.
As promised, here are the IO percentile plots on which the above QoS data was based:
Looking at that last chart, you'd think the P3700 was the winner with so many IOs coming in faster than the 9100 MAX, but the key is how long those slower IOs (tail latency) took to complete.