Intel's Optane DC Persistent Memory DIMMs Push Latency Closer to DRAM

Subject: Storage | December 12, 2018 - 09:17 AM |
Tagged: ssd, Optane, Intel, DIMM, 3D XPoint

Intel's architecture day press release contains the following storage goodness mixed within all of the talk about 3D chip packaging:

Memory and Storage: Intel discussed updates on Intel® Optane™ technology and the products based upon that technology. Intel® Optane™ DC persistent memory is a new product that converges memory-like performance with the data persistence and large capacity of storage. The revolutionary technology brings more data closer to the CPU for faster processing of bigger data sets like those used in AI and large databases. Its large capacity and data persistence reduces the need to make time-consuming trips to storage, which can improve workload performance. Intel Optane DC persistent memory delivers cache line (64B) reads to the CPU. On average, the average idle read latency with Optane persistent memory is expected to be about 350 nanoseconds when applications direct the read operation to Optane persistent memory, or when the requested data is not cached in DRAM. For scale, an Optane DC SSD has an average idle read latency of about 10,000 nanoseconds (10 microseconds), a remarkable improvement.2  In cases where requested data is in DRAM, either cached by the CPU’s memory controller or directed by the application, memory sub-system responsiveness is expected to be identical to DRAM (<100 nanoseconds).
The company also showed how SSDs based on Intel’s 1 Terabit QLC NAND die move more bulk data from HDDs to SSDs, allowing faster access to that data.

Did you catch that? 3D XPoint memory in DIMM form factor is expected to have an access latency of 350 nanoseconds! That's down from 10 microseconds of the PCIe-based Optane products like Optane Memory and the P4800X. I realize those are just numbers, and showing a nearly 30x latency improvement may be easier visually, so here:

View Full Size

Above is an edit to my Bridging the Gap chart from the P4800X review, showing where this new tech would fall in purple. That's all we have to go on for now, but these are certainly exciting times. Consider that non-volatile storage latencies have improved by nearly 100,000x over the last decade, and are now within striking distance (less than 10x) of DRAM! Before you get too excited, realize that Optane DIMMs will be showing up in enterprise servers first, as they require specialized configurations to treat DIMM slots as persistent storage instead of DRAM. That said, I'm sure the tech will eventually trickle down to desktops in some form or fashion. If you're hungry for more details on what makes 3D XPoint tick, check out how 3D XPoint works in my prior article.

View Full Size

Video News

December 12, 2018 | 06:38 PM - Posted by Paul A. Mitchell (not verified)


Can we anticipate a capability to do a
fresh OS install to a region of NVDIMMs
like the ones you describe?

Such a region seems like an ideal place
to host an OS e.g. by enhancing a BIOS/UEFI
subsystem to support a "Format RAM" option.

BEST WAY is to implement this option so that
it's entirely transparent to Windows install logic:
Windows install would NOT even know that the
target C: partition is a set of Non-Volatile DIMMs.

Perhaps a group like JEDEC would consider
setting standards for hosting any OS
in such an NVDIMM "region".

December 13, 2018 | 02:02 PM - Posted by Allyn Malventano

I suspect Optane DIMMs (in current enterprise form) are meant to be managed by the software layer that is accessing them, similar to SMR HDDs. I suppose it would be possible to create a block storage device driver and then install an OS to it, but that's not the intended use case at present, and OS hosting duties are served reasonably well on modern NAND / Optane PCIe devices. We're already approaching diminishing performance returns at those levels anyway. Taking real advantage of what you describe would require a complete restructuring of how an OS accesses storage - otherwise, you are just throwing away much of the latency benefits in the translation to relatively high overhead methods of accessing the media.

December 13, 2018 | 05:47 PM - Posted by Jim Reitz (not verified)

"Taking real advantage of what you describe would require a complete restructuring of how an OS accesses storage"

Agreed. That's also probably evident by the above latency Percentile graph showing a RamDisk latency significantly worse than DRAM. Since RamDisk's use DRAM, this kind of shows the extra latency involved with doing storage I/O versus just reading main memory/RAM.

December 15, 2018 | 01:45 PM - Posted by Paul A. Mitchell (not verified)

How about this sequence:

(1) configure a ramdisk in the uppermost 64GB of DRAM;
(2) run "Migrate OS" using Partition Wizard;
(3) re-boot into the migrated OS resident in the ramdisk.

The only other BIG changes are modifications
to the motherboard BIOS/UEFI subsystem,
to detect and boot from this ramdisk OS; and,
a general purpose device driver like the
one that supports RamDisk Plus from
(my favorite, as you know).

Wendell describes 128GB in a Threadripper system here:

The original Provisional Patent Application assumed
volatile DRAM, which had its own special problems
of course e.g. at SHUTDOWN.

It seems to me that the availability of non-volatile
memory on a DDR4 bus is the really BIG CHANGE
that obtains with Optane DIMMs.

MAX HEADROOM with 4 x M.2 SSDs in RAID-0
installed on the ASRock Ultra Quad AIC is 15,753.6 MBps.
By comparison, DDR4-3200 x 8 = 25,600 MBps, and
even faster DDR4 have been announced by G.SKILL et al.
DDR4-4000 X 8 = 32,000 MBps.

However, I don't really see the need for a
"complete restructuring of how an OS accesses storage".

Honestly, without doing the appropriate experiments,
I suspect that the latter characterization may be
closer to a "straw man".

Commercial device drivers are already available
in software like DATARAM:

In the interests of computer science (if nothing else),
I would certainly like to see this experiment
performed on a TR system like Wendell's.

Of course, we've need to have access to the BIOS/UEFI
code, in order to compile and flash an experimental version
that recognizes the new location of the bootstrap loader, etc.

Maybe we could submit a proposal to Ryan after he
starts working at Intel; it certainly has the
resources necessary to do this experiment. And,
such an experiment seems to fit his job description

If not Intel, then maybe AMD?

Thanks for listening! /s/ Paul

December 15, 2018 | 02:18 PM - Posted by Paul A. Mitchell (not verified)

On a TR system with LOTSA DDR4, LOTSA possibilities
come to my mind e.g.:

(a) it may be possible to dedicate one or more CPU cores
to the ramdisk device driver: that way, the raw code
would "migrate" into each core's internal caches
for extra computational speed;

(b) starting with 128GB, the lowest 4GB of DRAM addresses
could serve as "initialization" RAM; after doing the
"Migrate OS" step, the entire 124GB remaining could be
formatted as a single C: partition (which is a very
common practice on many PCs);

(c) after the "Migrate OS" step is completed successfully,
there are 2 copies of the OS i.e. "mirrored" --
one of which is hosted on conventional Nand Flash SSDs and
one of which is hosted in the ramdisk C: partition;

(d) if/when the ramdisk OS develops problems e.g. virus,
then simply boot from the conventional SSDs and restore a
valid drive image to the ramdisk C: partition.

Our main workstation now has 4 copies of our Windows OS,
restored to the primary NTFS partition on 4 different drives. With this setup, it has been very easy to
change the boot device in the BIOS whenever we need
to restore a drive image to the main C: partition.

We developed the latter setup chiefly because
the CD-ROM software for restoring a drive image
is terribly slow and time-consuming.

December 15, 2018 | 02:43 PM - Posted by Paul A. Mitchell (not verified)

Re: (b) starting with 128GB, the lowest 4GB of DRAM addresses
could serve as "initialization" RAM;

That is an extreme case:
the amount of conventional RAM
that is NOT assigned to the ramdisk
is a design decision dictated
by the intended use case(s).

Clearly, the amount of RAM
assigned to the ramdisk and
the amount of RAM NOT assigned
to the ramdisk are in a
"zero sum" relationship.

December 15, 2018 | 04:16 PM - Posted by Paul A. Mitchell (not verified)

FYI: a summary page published by computer scientists
at North Carolina State University, Computer Science Dept.:

December 15, 2018 | 04:44 PM - Posted by Paul A. Mitchell (not verified)

This research article is a little dated, but
it does cover a number of related issues with NVM:

December 18, 2018 | 08:10 AM - Posted by Wes Baggerly (not verified)

I believe Microsoft has designed Server 2019 to use these as Cache Drives for Storage Spaces Direct. They called it Persistent Memory or NVDIMM-N. And they talked about it at this year's Ignite.

December 19, 2018 | 04:52 PM - Posted by Paul A. Mitchell (not verified)

Allyn, Your chart shows DRAM ~17M IOPS.
Off the top of your head, how much does
that latency measure vary, in your experience?
The blue line appears to be rather constant
i.e. no tapering off at the top of that chart
(between 90% and 100%: compare the RAMDISK
green line).

December 20, 2018 | 02:54 PM - Posted by Allyn Malventano

The '~' figures are approximated based on the documented latency of those parts to the far left of the chart. I was basing this off of the number of clock cycles to access various levels of cache, etc.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.