Review Index:

Intel Optane SSD DC P4800X 375GB Review - Enterprise 3D XPoint

Subject: Storage
Manufacturer: Intel

Enterprise SSD Testing and Jargon

While enterprise SSDs typically share controller and flash memory architecture with consumer products shipping from the same company, there are some important differences to take note of. Enterprise units are typically equipped with high grade / more stringently binned flash memory components. Additional flash is installed proportional to the available capacity (overprovisioning) allowing for improved random write performance and greater endurance. Controller firmware is developed, optimized, and tuned for the type of workloads expected to be seen in its use. Enterprise parts go through more rigorous quality control testing.

If you think through the way you would test an enterprise SSD, you must first cast off the idea of running consumer-style benchmarks, which are typically performed on a partially filled drive and only apply their workload to a fraction of their available space.  This is not what an enterprise SSD is designed for, and it is also worth considering should you want to purchase an enterprise SSD for a system that would only ever see consumer style workloads – the firmware tuning of enterprise parts may actually result in poorer performance in some consumer workloads. While consumer SSDs lean towards combining bursts of random writes into large sequential blocks, such operations cannot be sustained indefinitely without sacrificing long term performance. Enterprise SSDs take the ‘slow and steady’ approach when subjected to random writes, foregoing heavy write combination operations in the interest of maintaining more consistent IOPS and lower latencies over time. Lower sustained write latencies are vital to the datacenters employing these devices.

Transfer Size

If you have ever combed through the various reviews of a given enterprise SSD, you will first note how ‘generic’ the data is. You won’t see specific applications used very often - instead you will see only a hand full of small workloads applied. These workloads are common to the specifications seen across the industry, and typically consist of 4KB and 8KB transfer sizes for random operations and 128KB sizes for sequential operations. 4KB and 8KB cover the vast majority of OLTP (on-line transaction processing) and Database (typically 8K) usage scenarios. 128KB stemmed as the default maximum transfer size as it meshes neatly with the maximum IO size that many OS kernels will issue to a storage device. Little known fact: Windows Operating System kernels will not issue transfer sizes larger than 128KB to a storage device. If an application makes a single 1MB request (QD=1) through the Windows API, that request is broken up by the kernel into 8 128KB sequential requests that are issued to the storage device simultaneously (QD=8, or up to the Queue Depth limit for that device). I’m sorry to break it to you, but that means any benchmark apps you might have seen reporting results at block sizes >128KB were actually causing the kernel to issue 128KB requests at inflated queue depths.

Queue Depth

View Full Size

Alright, now with the transfer sizes out of the way, we come to another extremely important factor in testing these devices, and that is the Queue Depth (QD). Since the early SCSI and ATA (before SATA) days, a Command Queue was implemented. Hard Disk Drives that supported Native Command Queueing (NCQ) could coordinate with the host system and receive a short list of the IO requests that were pending and can even fulfill those requests out of the order received. This made access to the relatively slow disk much more efficient, as the drive knew what was coming as opposed to the old method, which issued IO requests sequentially. With optimized algorithms in the HDD firmware, NCQ can show boosts of up to 200% in random IOPS performance when compared to the same drive operating without a Queue. Fast forward to the introduction of SSDs. Instead of optimizing the read pattern of a HDD head pack, queueing was still useful as an SSD controller could leverage the queue to address multiple flash dies across multiple internal data channels simultaneously, greatly improving the possible throughput (especially with smaller random transfers). ATA / SATA / AHCI devices are limited to the legacy limit of 32 items in the queue (QD=32), but that is more than sufficient to saturate the now relatively limited maximum bandwidth of 6Gbit/sec. PCIe (AHCI) devices can go higher, and the NVMe specification was engineered to allow queue depths as high as 65536 (2^16), and can also support the same number of simultaneous queues! Having multiple queues is a powerful feature, as it helps to minimize excessive context switching across processor cores. Present day NVMe drivers typically assign one queue to each processor thread, minimizing the excessive resource / context switching that would occur if all cores and threads had to share a single large queue. Realize that there are only so many flash dies and so much communication bandwidth available on a given SSD, so we won’t see SSDs operating near the limits of these new higher queueing limits any time soon.

View Full Size

Before moving on, it's worth noting that while NVMe can handle multiple queues and very high depths, the majority of actual enterprise workloads are unlikely to exceed QD=64 and even that is a rarity. Another item to consider is that storage devices with higher performance at lower queue depths will effectively 'shallow the queue', meaning the same workload applied to two devices will balance out to vastly different queue depths if one of those devices is able to service the requests faster than they are incoming.

% Read / Write

Alright, so we have transfer sizes and queue depths, but we are not done. Another important variable is the percentage of reads vs. writes being applied to the device. A typical figure thrown around for databases is 70/30, meaning just under 3/4 of the workload consists of read operations. Other specs assume the ratio (4KB random write = 0/100, or 0% reads). Another spec typically on this line is ‘100%’, as in ‘100% 4KB random write’. In this context, ‘100%’ is not taking about a read or write percentage, it is referring to the fact that 100% of the drive span is being accessed during the test. The span of the drive represents the range of Logical Blocks (LBAs) presented to the host by the SSD. Remember that SSDs are overprovisioned and have more flash installed than they make available to the host. This is one of the tricks that enable an enterprise SSD to maintain higher sustained performance as compared to a consumer SSD. Consumer SSDs typically have 5-7% OP, while enterprise SSDs will tend to have higher values based on their intended purpose. ‘ECO’ units designed primarily for reads may run closer to consumer levels of OP, while other units designed to handle sustained small random writes could run at 50% or higher OP. Some enterprise SSDs come with special tools that enable the system builder to dial in their own OP values based on the intended workload and desired performance and endurance).


View Full Size

Latency is not a variable we put into our testing, but it is our most important result. IOPS alone does not tell the whole story, as many datacenter workloads are very sensitive to the latency of each IO request. Imagine if the system first needs one piece of data to then perform some mathematical work and then save the result back to the flash. This sequential operation spends much of its time waiting on the storage subsystem, and latencies represent the amount of time waited for each of those IO requests. The revision of testing and results covered in today's article are based on both average latency (next page) and fine-grained analysis of Latency Percentiles at PACED workloads (two pages ahead).

April 20, 2017 | 07:52 PM - Posted by Uniblaab2

Thanks for the review(pre consumer) of optane which I had been waiting for a while now. First none and now two, one on another site that I respect. Big thanks for the latency graphing from 1 clock cycle to a floppy drive. Very informative and something I was wondering about after getting a picture of intel placing the idea that it could be a go between storage and dimms. You test at very high queue depths but seem to state that some testing for a web server is not the best idea. Isnt it true that a webserver is the only place where high queue depths are to be seen? If so, and queue depths normally seen are much lower, where would one expect to see such high queue depths - or is it as you seem to say, its just a test to test?

Thanks for the article, I will have to wait for you to test again when you get one in your hands and likely find that consumers are at the door of another exponential shift like the one where ssd's were used as boot drives when the price came down. We will more than likely start placing our Os's on optane drives in our ssd system to gain additional quickness.

When they become available, Pcper "must" see what it will take to boot a computer in a second with a optane boot drive.SSD is 10 second possible. Nuff said.

April 20, 2017 | 09:01 PM - Posted by Allyn Malventano

Regarding high QD's, there are some rare instances, and it is possible that a web server could be hitting the flash so hard that it hits such a high QD, but if that happens I'd argue that the person specing out that machine did not give it nearly enough RAM cache (and go figure, this new tech can actually help there as well since it can supplement / increase effective RAM capacity if configured appropriately).

Regarding why I'm still testing NVMe parts to QD=256, it's mostly due to NVMe NAND part specs for some products stretching out that high. I have to at least match the workloads and depths that appear in product specs in order to confirm / verify performance to those levels.

I'm glad you saw benefit in the bridging the gap charts. Fortunately, my floppy drive still works and I was able to find a good disk! :). I had to go through three zip disks before finding one without the 'click of death'!

April 20, 2017 | 08:13 PM - Posted by 213NSX

Holy smokes!

Hey great work here A, as usual.

April 20, 2017 | 08:21 PM - Posted by MRFS

Ditto that, Allyn: you are THE BEST!

> In the future, a properly tuned driver could easily yield results matching our 'poll' figures but without the excessive CPU overhead incurred by our current method of constantly asking the device for an answer.


The question that arose for me from your statement above
is this:

With so many multi-core CPUs proliferating,
would it help at all if a sysadmin could
"lock" one or more cores to the task
of processing the driver for this device?

The OS would then effectively "quarantine"
i.e. isolate that dedicated core from scheduling
any other normally executing tasks.

Each modern core also has large integrated caches,
e.g. L2 cache.

As such, it occurred to me that the driver
for this device would migrate its way
into the L2 cache of such a "dedicated" core
and help reduce overall latency.

Is this worth consideration, or am I out to lunch here?

Again, G-R-E-A-T review.

April 20, 2017 | 09:04 PM - Posted by Allyn Malventano

Locking a core to storage purposes would sort of help, except you would then have to communicate across cores with each request, which may just be robbing Peter to pay Paul. The best solution is likely a hybrid between polling and IRQ, or polling that has waits pre-tuned to the device to minimize needlessly spinning the core. Server builders will likely not want to waste so many resources constantly polling the storage anyway, so the more efficient the better here.

April 20, 2017 | 08:30 PM - Posted by MRFS

for example:

"Whether you want to squeak out some extra Windows 7 performance on your multi-core processor or run older programs flawlessly, you can set programs to run on certain cores in your processor. In certain situations this process can dramatically speed up your computer’s performance."

April 20, 2017 | 09:07 PM - Posted by Allyn Malventano

I did some experimentation with setting of affinity on the server, and I was able to get latency improvements similar to polling, but there were other consequences such as not being able to reach the same IOPS levels per thread (typical IO requests can be processed by the kernel faster if the various related processes are allowed to span multiple threads). Room for improvement here but not as simple as an affinity tweak is all.

April 21, 2017 | 12:25 PM - Posted by MRFS announces the first ever CLONE AUCTION:

This auction will offer exact CLONES of Allyn Malventano,
complete with his entire computing experience intact.

Minimum starting bid is $1M USD. CASH ONLY.

Truly, Allyn, you are a treasure to the entire PC community.


April 21, 2017 | 10:26 AM - Posted by PCPerFan

I'm glad there was at least one comparison with the 960 pro, which is the most interesting graph in the article. I just wish there were more comparisons.

April 21, 2017 | 10:43 AM - Posted by Allyn Malventano

Your additional answers are coming soon!

April 21, 2017 | 01:57 PM - Posted by MRFS

Speaking of comparisons, I am now very curious to know if Intel plans to develop an M.2 Optane SSD that uses all x4 PCIe 3.0 lanes instead of x2 PCIe 3.0 lanes.

Also, we need to take out a life insurance policy on Allyn, because we want him around to do his expert comparisons when the 2.5" U.2 Optane SSD becomes available.

If Intel ultimately commits to manufacturing Optane in all of the following form factors, we should expect it to be nothing short of disruptive (pricing aside, for now):

(a) AIC (add-in card)
(b) M.2 NVMe
(c) U.2 2.5"
(d) DIMM

I would love to know that a modern OS can be hosted by the P4800X and all future successors!

PCIe 4.0 here we go!