Intel Optane SSD DC P4800X 750GB Review - In the Flesh
Enterprise SSD Testing and Jargon
While enterprise SSDs typically share controller and flash memory architecture with consumer products shipping from the same company, there are some important differences to take note of. Enterprise units are typically equipped with high grade / more stringently binned flash memory components. Additional flash is installed proportional to the available capacity (overprovisioning) allowing for improved random write performance and greater endurance. Controller firmware is developed, optimized, and tuned for the type of workloads expected to be seen in its use. Enterprise parts go through more rigorous quality control testing.
If you think through the way you would test an enterprise SSD, you must first cast off the idea of running consumer-style benchmarks, which are typically performed on a partially filled drive and only apply their workload to a fraction of their available space. This is not what an enterprise SSD is designed for, and it is also worth considering should you want to purchase an enterprise SSD for a system that would only ever see consumer style workloads – the firmware tuning of enterprise parts may actually result in poorer performance in some consumer workloads. While consumer SSDs lean towards combining bursts of random writes into large sequential blocks, such operations cannot be sustained indefinitely without sacrificing long term performance. Enterprise SSDs take the ‘slow and steady’ approach when subjected to random writes, foregoing heavy write combination operations in the interest of maintaining more consistent IOPS and lower latencies over time. Lower sustained write latencies are vital to the datacenters employing these devices.
If you have ever combed through the various reviews of a given enterprise SSD, you will first note how ‘generic’ the data is. You won’t see specific applications used very often - instead you will see only a hand full of small workloads applied. These workloads are common to the specifications seen across the industry, and typically consist of 4KB and 8KB transfer sizes for random operations and 128KB sizes for sequential operations. 4KB and 8KB cover the vast majority of OLTP (on-line transaction processing) and Database (typically 8K) usage scenarios. 128KB stemmed as the default maximum transfer size as it meshes neatly with the maximum IO size that many OS kernels will issue to a storage device. Little known fact: Windows Operating System kernels will not issue transfer sizes larger than 128KB to a storage device. If an application makes a single 1MB request (QD=1) through the Windows API, that request is broken up by the kernel into 8 128KB sequential requests that are issued to the storage device simultaneously (QD=8, or up to the Queue Depth limit for that device). I’m sorry to break it to you, but that means any benchmark apps you might have seen reporting results at block sizes >128KB were actually causing the kernel to issue 128KB requests at inflated queue depths.
Alright, now with the transfer sizes out of the way, we come to another extremely important factor in testing these devices, and that is the Queue Depth (QD). Since the early SCSI and ATA (before SATA) days, a Command Queue was implemented. Hard Disk Drives that supported Native Command Queueing (NCQ) could coordinate with the host system and receive a short list of the IO requests that were pending and can even fulfill those requests out of the order received. This made access to the relatively slow disk much more efficient, as the drive knew what was coming as opposed to the old method, which issued IO requests sequentially. With optimized algorithms in the HDD firmware, NCQ can show boosts of up to 200% in random IOPS performance when compared to the same drive operating without a Queue. Fast forward to the introduction of SSDs. Instead of optimizing the read pattern of a HDD head pack, queueing was still useful as an SSD controller could leverage the queue to address multiple flash dies across multiple internal data channels simultaneously, greatly improving the possible throughput (especially with smaller random transfers). ATA / SATA / AHCI devices are limited to the legacy limit of 32 items in the queue (QD=32), but that is more than sufficient to saturate the now relatively limited maximum bandwidth of 6Gbit/sec. PCIe (AHCI) devices can go higher, and the NVMe specification was engineered to allow queue depths as high as 65536 (2^16), and can also support the same number of simultaneous queues! Having multiple queues is a powerful feature, as it helps to minimize excessive context switching across processor cores. Present day NVMe drivers typically assign one queue to each processor thread, minimizing the excessive resource / context switching that would occur if all cores and threads had to share a single large queue. Realize that there are only so many flash dies and so much communication bandwidth available on a given SSD, so we won’t see SSDs operating near the limits of these new higher queueing limits any time soon.
Before moving on, it's worth noting that while NVMe can handle multiple queues and very high depths, the majority of actual enterprise workloads are unlikely to exceed QD=64 and even that is a rarity. Another item to consider is that storage devices with higher performance at lower queue depths will effectively 'shallow the queue', meaning the same workload applied to two devices will balance out to vastly different queue depths if one of those devices is able to service the requests faster than they are incoming.
% Read / Write
Alright, so we have transfer sizes and queue depths, but we are not done. Another important variable is the percentage of reads vs. writes being applied to the device. A typical figure thrown around for databases is 70/30, meaning just under 3/4 of the workload consists of read operations. Other specs assume the ratio (4KB random write = 0/100, or 0% reads). Another spec typically on this line is ‘100%’, as in ‘100% 4KB random write’. In this context, ‘100%’ is not taking about a read or write percentage, it is referring to the fact that 100% of the drive span is being accessed during the test. The span of the drive represents the range of Logical Blocks (LBAs) presented to the host by the SSD. Remember that SSDs are overprovisioned and have more flash installed than they make available to the host. This is one of the tricks that enable an enterprise SSD to maintain higher sustained performance as compared to a consumer SSD. Consumer SSDs typically have 5-7% OP, while enterprise SSDs will tend to have higher values based on their intended purpose. ‘ECO’ units designed primarily for reads may run closer to consumer levels of OP, while other units designed to handle sustained small random writes could run at 50% or higher OP. Some enterprise SSDs come with special tools that enable the system builder to dial in their own OP values based on the intended workload and desired performance and endurance).
Latency is not a variable we put into our testing, but it is our most important result. IOPS alone does not tell the whole story, as many datacenter workloads are very sensitive to the latency of each IO request. Imagine if the system first needs one piece of data to then perform some mathematical work and then save the result back to the flash. This sequential operation spends much of its time waiting on the storage subsystem, and latencies represent the amount of time waited for each of those IO requests. The revision of testing and results covered in today's article are based on both average latency (next page) and fine-grained analysis of Latency Percentiles at PACED workloads (two pages ahead).