A Closer Look at Intel's Optane SSD DC P4800X Enterprise SSD Performance

Subject: Storage | February 10, 2017 - 04:22 PM |
Tagged: Optane, XPoint, P4800X, 375GB

Over the past few hours, we have seen another Intel Optane SSD leak rise to the surface. While we previously saw a roadmap and specs for a mobile storage accelerator platform, this time we have some specs for an enterprise part:

View Full Size

The specs are certainly impressive. While they don't match the maximum theoretical figures we heard at the initial XPoint announcement, we do see an endurance rating of 30 DWPD (drive writes per day), which is impressive given competing NAND products typically run in the single digits for that same metric. The 12.3 PetaBytes Written (PBW) rating is even more impressive given the capacity point that rating is based on is only 375GB (compare with 2000+ GB of enterprise parts that still do not match that figure).

Now I could rattle off the rest of the performance figures, but those are just numbers, and fortunately we have ways of showing these specs in a more practical manner:

View Full Size

Assuming the P4800X at least meets its stated specifications (very likely given Intel's track record there), and also with the understanding that XPoint products typically reach their maximum IOPS at Queue Depths far below 16, we can compare the theoretical figures for this new Optane part to the measured results from the two most recent NAND-based enterprise launches. To say the random performance makes leaves those parts in the dust is an understatement. 500,000+ IOPS is one thing, but doing so at lower QD's (where actual real-world enterprise usage actually sits) just makes this more of an embarrassment to NAND parts. The added latency of NAND translates to far higher/impractical QD's (256+) to reach their maximum ratings.

View Full Size

Intel research on typical Queue Depths seen in various enterprise workloads. Note that a lower latency device running the same workload will further 'shallow the queue', meaning even lower QD.

Another big deal in the enterprise is QoS. High IOPS and low latency are great, but where the rubber meets the road here is consistency. Enterprise tests measure this in varying degrees of "9's", which exponentially approach 100% of all IO latencies seen during a test run. The plot method used below acts to 'zoom in' on the tail latency of these devices. While a given SSD might have very good average latency and IOPS, it's the outliers that lead to timeouts in time-critical applications, making tail latency an important item to detail.

View Full Size

View Full Size

I've taken some liberties in my approximations below the 99.999% point in these plots. Note that the spec sheet does claim typical latencies "<10us", which falls off to the left of the scale. Not only are the potential latencies great with Optane, the claimed consistency gains are even better. Translating what you see above, the highest percentile latency IOs of the P4800X should be 10x-100x (log scale above) faster than Intel's own SSD DC P3520. The P4800X should also easily beat the Micron 9100 MAX, even despite its IOPS being 5x higher than the P3520 at QD16. These lower latencies also mean we will have to add another decade to the low end of our Latency Percentile plots when we test these new products.

Well, there you have it. The cost/GB will naturally be higher for these new XPoint parts, but the expected performance improvements should make it well worth the additional cost for those who need blistering fast yet persistent storage.


February 10, 2017 | 05:42 PM - Posted by mAxius

Nice data when can i buy one?

February 11, 2017 | 06:56 AM - Posted by Anonymous (not verified)

As soon as you are billionaire ;-)
It is going to be expensive.

February 11, 2017 | 11:24 PM - Posted by Anonymous (not verified)

Okay, I'll wait for the AMD memory.

*skeleton emoji*

February 12, 2017 | 10:18 AM - Posted by Anonymous (not verified)

You QuantX some Optane on those DIMMs as a side order of NVM to go with the DRAM.

Intel's Optane or Micron's QuantX is all just XPoint to me! Now let the competition begin.

February 12, 2017 | 11:23 AM - Posted by Anonymous (not verified)

This news just in! Russia announces is own brand of XPoint:
Putintane.

When queried repeatingly about the branding the Russian leader became agitated and stated, Is PUTINTANE! Ask me again and I'll tell you the same!

February 10, 2017 | 06:39 PM - Posted by Paul A. Mitchell (not verified)

Another GOOD ONE, Allyn!

You know my bias already:

I took careful note of this spec:

Form Factor: "PCIe 3.0 x4"

I can't help myself: what happens
if the edge connector is ramped up
to a full x16 PCIe 3.0?

Would it be realistic to scale upwards
by a factor close to 4X?

If not a perfect 4X, then how close?

I know I'm speculating on some future as yet unknown,
but that future is not "over the horizon" any longer!

KEEP UP THE GOOD WORK, as always!!

February 10, 2017 | 06:55 PM - Posted by jgstew

I don't think it would scale up at all unless they made a different one that was more parallel.

500000 iops is ~2000mb/s so I think this thing is already at it's limit and I don't think more bandwidth would help.

Theoretically there is nothing to prevent a vendor from having enough chips to spread read/writes across to saturate PCI-E x16 bus but usually this would require multiple controllers to do so which would add latency and complexity, like the OCZ RevoDrive X2 that effectively had 4 SATA SSDs in a RAID0 on the card.

The alternative would be to have a single "super controller" that would have enough interconnections to all of the chips directly, but that option doesn't scale up and down as easily with the same controller.

February 10, 2017 | 07:02 PM - Posted by Paul A. Mitchell (not verified)

Good points! Thanks for your interest.

February 12, 2017 | 12:44 PM - Posted by Anonymous (not verified)

Why do you need multiple controllers when RAIN(redundant array of independent NAND) and other such technology has been i widespread use for years?

Im not being sarcastic, i just dont know the necessity. I know that some drives have a PLX bridge chip on them but i believe thats for drives where it requires PCIE X8 and each drive is only capable of being addressed by the CPU in X4 or something like that.

February 10, 2017 | 07:04 PM - Posted by Allyn Malventano

This first gen card is not fully saturating x4, so there is room for improvement. Micron already showed results for a x8 AIC scaling to 1,800,000 IOPS. It's all really just down to having a PHY (controller) capable of moving that much data around. That said, going higher than 2 million IOPS via NVMe means you are throwing a lot of cycles away on overhead, so it's better to go with DIMM connections for lower latency / higher performance out of even smaller form factors without the need for PCIe, controllers, etc.

February 11, 2017 | 06:39 AM - Posted by Anonymous (not verified)

Sounds like NVMe needs some sort of update to make full use of Optane then.

I'd be surprised if AMD had some sort of Optane support for their processors memory controller anytime soon.

February 12, 2017 | 12:49 PM - Posted by Anonymous (not verified)

How are the XPoint DIMMs addressed? If im not mistaken, that will be the biggest use of XPoint and if its being directly addressed by the memory controller then NVMe wont come into play at all in many cases.

February 10, 2017 | 06:58 PM - Posted by Paul A. Mitchell (not verified)

Found this today:
http://www.tomshardware.com/news/optane-3d-xpoint-intel-p4800x-cold-stre...

[begin quote]

The SSD features the standard PCIe 3.0 x4 connection and offers up to 2,400/2,000 MB/s of sequential read/write throughput over the NVMe interface.

[end quote]

These numbers appear to fall short of MAX HEADROOM for
sequential READs, as follows:

x4 PCIe 3.0 @ 8 GHz per lane / 8.125 bits per byte = 3,938.4 MB/second

1.0 - (2,400 / 3,938.4) = ~ 39.0% aggregate overhead (sequential READs)

p.s. Rather than to choke a 2.5" Optane SSD
with a slow 6G SATA-III interface, Intel should
consider a 2.5" Optane SSD with U.2 connector -and-
a new option to vary the transmission rate
at least to equal 12G SAS, re-computing now:

x4 PCIe 3.0 @ 12 GHz per lane / 8.125 bits per byte = 5,907.7 MB/second

Latter also assumes a 128b/130b "jumbo frame"
(12G still uses an 8b/10b "legacy frame").

February 10, 2017 | 07:06 PM - Posted by Allyn Malventano

Yup. Max seen from a 3.0 x4 XPoint device is here (top image), basically saturating the bus at 900,000 IOPS.

February 11, 2017 | 03:47 PM - Posted by Master Chen (not verified)

"Saturating the bus"? But that's just x4. Are you saying that there essentially will be no difference if, for example, they make a PCI-e drive to goes in a full-blown PCI-e x16 slot, with all the pins on-board (like a GPU) and thus fully utilizing the lane?

February 16, 2017 | 01:21 PM - Posted by Allyn Malventano

No, it was saturating a x4 bus, not a x16 bus.

February 16, 2017 | 02:27 PM - Posted by Master Chen (not verified)

Naruhodou...

February 16, 2017 | 06:21 PM - Posted by Jeremy Hellstrom

Had to Google that; after your random capitilization recently one wonders which kind of mad you are claiming not to be,

February 17, 2017 | 02:52 AM - Posted by Master Chen (not verified)

Learn to behave, Jeremiah.

February 10, 2017 | 07:30 PM - Posted by Paul A. Mitchell (not verified)

Thanks, Allyn. Another great article.

Can't wait to see your empirical benchmarks.

p.s. I'm not too disappointed that Intel's
earliest claims were exaggerated. If Intel
can deliver some of these current claims,
they'll have something worth investigating
thoroughly ...

e.g. I keep coming back to their
former triple-channel memory architecture:

If such a triple-channel chipset can be
modified to host an OS with Optane
installed in the third "bank",
that major development could usher in
an INSTANT-ON functionality.

Heck, partition that third bank
with TWIN OS partitions, enabling a truly
"Hot Spare OS" -- call it "HOT SAUCE! LOL!!

February 10, 2017 | 10:30 PM - Posted by Allyn Malventano

It's not that the claims were exaggerated as much as they were stating the raw speeds of the base level XPoint. Implementation is a different story, especially since controller tech isn't meant to deal with speeds approaching that of RAM.

February 11, 2017 | 01:48 AM - Posted by Tim Verry

Thats true, as they got closer to launching products they kept backing down the expectations but that is just what is possible today. The original numbers, from what I can tell, are more of what the xpoint technology is potentially capable even if we cant implement that right now. Not as mind explodung fast as originally teased for sure, but even the gen 1 stuff is hella fast for what it is.

February 11, 2017 | 11:28 AM - Posted by Paul A. Mitchell (not verified)

> controller tech isn't meant to deal with speeds approaching that of RAM.

Allyn,

Good point about controller efficiencies:

Here is a parametric comparison of Optane's reported
READ speed with the Samsung 960 Pro's reported READ speed,
projecting into the future with 12G and 16G clock rates:

x4 @ 8G / 8.125 x 61% = 2,400 MB/second (39% overhead)

x4 @ 12G / 8.125 x 61% = 3,600 MB/second (39% overhead)

x4 @ 16G / 8.125 x 61% = 4,800 MB/second (39% overhead)

Vortez.net reports 3,355 / 3,938.4 = 85.1% (14.9% overhead) with Samsung 960 Pro (JBOD):

x4 @ 8G / 8.125 x 85% = 3,348 MB/second (15% overhead)

x4 @ 12G / 8.125 x 85% = 5,022 MB/second (15% overhead)

x4 @ 16G / 8.125 x 85% = 6,696 MB/second (15% overhead)

Source:
https://www.vortez.net/articles_pages/samsung_960_pro_raid_review,7.html

It will be very interesting to see if NVMe SSDs
will match their clock rates with the 16G clock
planned for PCIe 4.0 chipsets.

In the meantime, NVMe could catch up with
the 12G clock now available with SAS drives.

As the numbers show above, the combination of
efficient controllers and faster clock rates
makes a big difference in sequential throughput.

February 12, 2017 | 06:04 AM - Posted by Anonymous (not verified)

Paul, I don't think that's what he means by controller not being up to par.

What he is saying is that the current implementation of the Optane controller, in this case the P4800Xs' doesn't have enough parallel channels to reach PCI Express interface its on.

Let's just assume, that the DC P4800X controller has 16 channels. That means each 3D XPoint dies are capable of achieving:

(2400MB/s read)/16 = 150MB/s
(2000MB/s write)/16 = 125MB/s

If they were to increase the channels in the controller by 50% to 24 channels, it would then be able to reach 3600MB/s read and 3000MB/s write.

If they doubled it to 32 channels, it would reach 4800MB/s read and 4000MB/s write right?

Actually, then with PCI Express x4, it would slightly be under 4GB/s in both read and write, because that's the theoretical peak of the interface.

If they took the 32-channel controller and put it to x8 interface, we'd be at 4800MB/s read and 4000MB/s write again. Does that mean the efficiency is at 4.8/8 or 60%? No it means again the controller doesn't have enough channels.

That what it means. Not that the controller is horrible that it only does 39% of peak.

February 17, 2017 | 10:41 AM - Posted by Paul A. Mitchell (not verified)

Many thanks!

February 17, 2017 | 11:06 AM - Posted by Paul A. Mitchell (not verified)

You explained that very well in your recent Podcast.:

https://www.pcper.com/news/Editorial/Podcast-437-EVGA-iCX-Zen-Architectu...

0:35:03

Thanks!

February 11, 2017 | 08:09 AM - Posted by Master Chen (not verified)

Wake me up when a "500/512/525GB, etc" model will be available. Considering those queue depths and IOPS, alongside with so-heavily-boasted-about level of XP's reliability, having 500GB of such space would pretty much equal my currently owned top-tier enterprise HDDs (Seagate's GODLIKE ST8000NM0055, six of them in a NAS, set in RAID 6) of a 8TB capacity. I'm planning to upgrade to Seagate's 16TB enterprise offerings sometime in the near future, when they'll become available, but if Optane of 500GB or more becomes available before that, I'll probably be building a new separate home server based strictly only off of that, disregarding that it has way less space than my enterprise HDDs. What matters the most to me, is reliability over anything else. I don't care much about write/read speeds or capacity, those are just added bonuses for me personally. I need insane reliability first, everything else comes second. This is because I mainly build long-term storage servers for myself, such configurations that aren't addressed each and every day (actually roughly only once each 6 months). Reliability of the device itself and integrity of information stored is a first priority to me, because I usually just write something down into my home server and then leave not just my house where it's stored, but the country, for more than 4 months straight, usually. I work overseas most of the time and I rarely visit my home these days, so it's very important for me that all of the information which I've stored is being preserved in a perfect state by the time I return back. TL;DR, whatever.

February 11, 2017 | 01:39 PM - Posted by Allyn Malventano

The answer to absolute reliability is redundancy, not Optane, and especially not for a home NAS. You are better off sticking with your RAID-6, but add a second with a mostly offline mirrored copy of everything and you'll be way more safe. It's what I do.

Also, Optane is going to run close to the cost of RAM, and won't even be available in high enough capacities to fit as much as you want in a single server for years to come. .

February 11, 2017 | 02:36 PM - Posted by Anonymous (not verified)

Is that the current inflated cost of RAM or will that reflect the RAM prices after the new RAM processes reaches enough production levels for RAM prices to return to their historical long term norms? Also what level of competition will there be from Micron’s QuantX versus Intel’s Optane brand of XPoint. What is Micron’s current QuantX to market estimates now that Intel’s Optane has begun to ship?

In relative terms how much denser is XPoint relative to DRAM in mm*2 and XPoint is going 3D also? I’m more interested in XPoint’s durability figures(For XPoint itself) and how XPoint will be used on some Hybrid DRAM/XPoint DIMM modules or even XPoint on the HBM2 Stacks along with the DRAM die stacks for maybe a JEDEC HBM#/NVM/XPoint revised standard.

February 11, 2017 | 04:16 PM - Posted by Master Chen (not verified)

They said so far that it's "x21 times more durable than modern MLC", and that would've been just fine for me personally (that's quite a big jump in durability, in all honesty), or so I thought at first, but...I'm not so sure anymore. I'm one of those guys who writes a ton of stuff on a drive, then completely unplugs it and puts it in a safe/on a shelf for 4~6 months at the very least, for me personally SSDs were a complete "no-go" variant so far simply due to the possibility of them leaking the charge while I'm away from home, that's why I really hoped that XP could change all of that. Because I clearly remember reading/hearing back in the days when XP was just introduced that XP's cells are "write once - store forever" type of a solution, never leaking out at all even if completely unplugged for years. Looks like my hopes have gone the drain...well, that sucks, if so.

February 11, 2017 | 04:08 PM - Posted by Master Chen (not verified)

What I've tried to say there, is that according to the info which was surfacing back in the days when XP was only introduced as a concept, XP's cells (unlike current-day SLC/MLC/TLC etc) were supposed to store information forevermore, regardless of power outages and electric spikes. Basically one you've written anything on XP, it stays there for infinite amounts of time, even if you unplug the drive completely and put it in a closet/safe/on a shelf, or something, and return to it many years later. Was that information pretty much refuted by now, or does that still hold up? That was quite a while ago, so I don't remember 100% clearly, but wasn't it Intel themselves who were proclaiming such "absolutely nonvolatile data retention and reliability levels"? Was all of that just PR and marketing BS?

If my memory is right, current modern day SLC (which is technically the best type of a memory cell you can currently get in a solid state storage device) keeps the written data in an absolutely perfect state for roughly ~18 (that's in a "perfect environmental conditions" scenario) months while device is completely unplugged from any power source and stored away, and MLC holds it up to ~8 months tops. Either way, no matter how high-quality/strong" a memory cell might be, it still leaks the charge due to an inevitable degradation of a "cage" in which the charge is being held. But I seriously thought that XP (again, according to the info from a while ago) wasn't supposed to leak anything at all. Write once - keep forevermore. Was all of that BS and I was hoping for too much?

February 12, 2017 | 12:59 PM - Posted by BlackDove (not verified)

Bit flips happen from cosmic ray air showers, so you need to worry about properly implemented ECC and scrubbing to get rid of silent data corruption too.

Perhaps Allyn can weigh in, but i believe that there is quite a bit of ECC on any decent NAND SSD.

I would like to know how well silent errors are handled with consumer level SSDs, prosumer and enterprise.

I would actually really like a video about silent data corruption in modern SSDs and how bad normal DDR4 js compared to ECC-DDR4.

February 12, 2017 | 03:51 PM - Posted by Master Chen (not verified)

Well OBVIOUSLY cosmic radiation and other natural factors take their toll, but that matters way less than simple memory cell discharge which happens inevitably with passage of time. After all, I'm not planning to go to space anytime soon, and Terra's/Gaia's magnetic field dissipates absolute majority of harmful Sol radiation...considering that I'm not going to hang out anywhere near the ozone hole anytime soon, I think I'll be fine in consideration to that specific set of parameters, lol.

February 12, 2017 | 12:55 PM - Posted by BlackDove (not verified)

Need rad hard and EMP hardened electronics, including the CPU, RAM and storage with mainframe design principles for insane reliability. Maybe triple modular redundancy with voting and silent error scrubbing too.

It could be done, but burning a DVD and putting it in a fireproof safe works too.

February 12, 2017 | 03:38 PM - Posted by Master Chen (not verified)

>Burning a DVD
Unless it's something a kin to an M-DISC of Millenniata's (which is basically like a thin stone disc, rather than a typical piece of plastic from which standard discs are made), "disc rot" would happen nonetheless eventually. Regardless of it being a CD, DVD, Blu-Ray or whatever else. If it's a piece of plastic - it rots with time due to oxidization and any (so room temps, pretty much) temperatures above minus degrees Celsius-based. Just like with VHS and cassette tape. It rots inevitably due to surrounding atmosphere taking it's toll. M-DISCs do not simply due to them essentially being stone tablets that get information recorded onto them by making small punctures. Yes, you're basically thinking right when you're reading that - it's "perfocards" (perfokaart/"punched card") all over again, except that this time around it's done with a piece of stone rather than paper and the punched holes themselves are way smaller.

February 11, 2017 | 02:51 PM - Posted by Paul A. Mitchell (not verified)

FYI: here's a nice CAD drawing of M.2 Optane:

http://www.legitreviews.com/wp-content/uploads/2017/01/Stoneybeach-Intel...

February 12, 2017 | 04:06 AM - Posted by Anonymous (not verified)

"To say the random performance makes leaves those parts in the dust is an understatement. 500,000+ IOPS is one thing, but doing so at lower QD's (where actual real-world enterprise usage actually sits) just makes this more of an embarrassment to NAND parts. "

Makes leaves...

February 13, 2017 | 10:30 PM - Posted by Paul A. Mitchell (not verified)

OK, an x4 NVMe PCIe 3.0 Add-In Card ("AIC")
should have exactly the same raw bandwidth
as a single NVMe PCIe 3.0 M.2 SSD:

x4 PCIe 3.0 lanes @ 8 GHz / 8.125 = 3,938.4 MB/second

So, can anybody explain to me why
the Samsung 960 Pro is already so far ahead
of the Optane specs "leaked" above?

I honestly feel as if I am missing something important,
but I can't put my finger on it.

Allyn wisely observes that all controllers
are not created equal: do controller differences
explain these big gaps that are being reported
in objective reviews?

I was of the (skewed?) opinion that Optane memory is
a technology that is essentially much faster
than all Nand Flash currently being manufactured
anywhere on the planet.

Was I wrong to trust Intel's projections?

Do I need to look beyond sequential READ speeds?

p.s. I won't take offense if any readers here
answer: YES, Paul, you were wrong.

Fire away: my skin is thick enough after falling
in love with computers in June 1971 :)

February 13, 2017 | 10:43 PM - Posted by Paul A. Mitchell (not verified)

Case in point: I just went browsing for a
recently reviewed NVMe M.2 SSD:

Corsair Force MP500 NVMe SSD review
http://jggmkz4ba1-flywheel.netdna-ssl.com/wp-content/images_reviews/revi...

MAX observed: 2,989.99 MB/second ATTO READ

February 14, 2017 | 03:32 PM - Posted by Paul A. Mitchell (not verified)

another "lower side" NVMe M.2 just announced:

http://www.legitreviews.com/wd-black-512gb-m-2-pcie-nvme-ssd-review_191242

The WD Black PCIe SSDs have sequential read/write speeds of up to 2050MB/s read and up to 800MB/s write with the backing of a nice 5-year limited warranty. Those speeds are pretty impressive, but believe it or not they are on the lower side compared to many of the other NVMe drives on the market, so this is something we’d actually consider an entry-level PCIe NVMe SSD.

[end quote]

February 14, 2017 | 03:33 PM - Posted by Paul A. Mitchell (not verified)

http://www.legitreviews.com/wp-content/uploads/2017/02/atto-wd-black-nvm...

February 14, 2017 | 07:00 PM - Posted by Paul A. Mitchell (not verified)

:-]

https://www.youtube.com/watch?v=mNkve5XJlX0

confirming the DMI 3.0 bottleneck (again)

installing the correct NVMe driver helps too

February 17, 2017 | 10:39 AM - Posted by Paul A. Mitchell (not verified)

Allyn,

Can you do a quick check of my calculation here,
please?

Specs = Random 4kB IOPS up to 550K (READ)

550,000 IOPS @ 4,096 bytes = 2,252,800,000 bytes

Is that the correct way to compute random throughput?

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.