Introduction and Specifications
XPoint. Optane. QuantX. We've been hearing these terms thrown around for two years now. A form of 3D stackable non-volatile memory that promised 10x the density of DRAM and 1000x the speed and endurance of NAND. These were bold statements, and over the following months, we would see them misunderstood and misconstrued by many in the industry. These misconceptions were further amplified by some poor demo choices on the part of Intel (fortunately countered by some better choices made by Micron). Fortunately cooler heads prevailed as Jim Handy and other industry analysts helped explain that a 1000x improvement at the die level does not translate to the same improvement at the device level, especially when the first round of devices must comply with what will soon become a legacy method of connecting a persistent storage device to a PC.
Did I just suggest that PCIe 3.0 and the NVMe protocol - developed just for high-speed storage, is already legacy tech? Well, sorta.
That 'Future NVM' bar at the bottom of that chart there was a 2-year old prototype iteration of what is now Optane. Note that while NVMe was able to shrink down the yellow bar a bit, as you introduce faster and faster storage, the rest of the equation (meaning software, including the OS kernel) starts to have a larger and larger impact on limiting the ultimate speed of the device.
NAND Flash simplified schematic (via Wikipedia)
Before getting into the first retail product to push all of these links in the storage chain to the limit, let's explain how XPoint works and what makes it faster. Taking random writes as an example, NAND Flash (above) must program cells in pages and erase cells in blocks. As modern flash has increased in capacity, the sizes of those pages and blocks have scaled up roughly proportionally. At present day we are at pages >4KB and block sizes in the megabytes. When it comes to randomly writing to an already full section of flash, simply changing the contents of one byte on one page requires the clearing and rewriting of the entire block. The difference between what you wanted to write and what the flash had to rewrite to accomplish that operation is called the write amplification factor. It's something that must be dealt with when it comes to flash memory management, but for XPoint it is a completely different story:
XPoint is bit addressible. The 'cross' structure means you can select very small groups of data via Wordlines, with the ultimate selection resolving down to a single bit.
Since the programmed element effectively acts as a resistor, its output is read directly and quickly. Even better - none of that write amplification nonsense mentioned above applies here at all. There are no pages or blocks. If you want to write a byte, go ahead. Even better is that the bits can be changed regardless of their former state, meaning no erase or clear cycle must take place before writing - you just overwrite directly over what was previously stored. Is that 1000x faster / 1000x more write endurance than NAND thing starting to make more sense now?
Ok, with all of the background out of the way, let's get into the meat of the story. I present the P4800X:
Subject: Storage | February 10, 2017 - 04:22 PM | Allyn Malventano
Tagged: Optane, XPoint, P4800X, 375GB
Over the past few hours, we have seen another Intel Optane SSD leak rise to the surface. While we previously saw a roadmap and specs for a mobile storage accelerator platform, this time we have some specs for an enterprise part:
The specs are certainly impressive. While they don't match the maximum theoretical figures we heard at the initial XPoint announcement, we do see an endurance rating of 30 DWPD (drive writes per day), which is impressive given competing NAND products typically run in the single digits for that same metric. The 12.3 PetaBytes Written (PBW) rating is even more impressive given the capacity point that rating is based on is only 375GB (compare with 2000+ GB of enterprise parts that still do not match that figure).
Now I could rattle off the rest of the performance figures, but those are just numbers, and fortunately we have ways of showing these specs in a more practical manner:
Assuming the P4800X at least meets its stated specifications (very likely given Intel's track record there), and also with the understanding that XPoint products typically reach their maximum IOPS at Queue Depths far below 16, we can compare the theoretical figures for this new Optane part to the measured results from the two most recent NAND-based enterprise launches. To say the random performance makes leaves those parts in the dust is an understatement. 500,000+ IOPS is one thing, but doing so at lower QD's (where actual real-world enterprise usage actually sits) just makes this more of an embarrassment to NAND parts. The added latency of NAND translates to far higher/impractical QD's (256+) to reach their maximum ratings.
Intel research on typical Queue Depths seen in various enterprise workloads. Note that a lower latency device running the same workload will further 'shallow the queue', meaning even lower QD.
Another big deal in the enterprise is QoS. High IOPS and low latency are great, but where the rubber meets the road here is consistency. Enterprise tests measure this in varying degrees of "9's", which exponentially approach 100% of all IO latencies seen during a test run. The plot method used below acts to 'zoom in' on the tail latency of these devices. While a given SSD might have very good average latency and IOPS, it's the outliers that lead to timeouts in time-critical applications, making tail latency an important item to detail.
I've taken some liberties in my approximations below the 99.999% point in these plots. Note that the spec sheet does claim typical latencies "<10us", which falls off to the left of the scale. Not only are the potential latencies great with Optane, the claimed consistency gains are even better. Translating what you see above, the highest percentile latency IOs of the P4800X should be 10x-100x (log scale above) faster than Intel's own SSD DC P3520. The P4800X should also easily beat the Micron 9100 MAX, even despite its IOPS being 5x higher than the P3520 at QD16. These lower latencies also mean we will have to add another decade to the low end of our Latency Percentile plots when we test these new products.
Well, there you have it. The cost/GB will naturally be higher for these new XPoint parts, but the expected performance improvements should make it well worth the additional cost for those who need blistering fast yet persistent storage.