Leaked Intel Roadmap Details Upcoming Optane XPoint SSDs and Storage Accelerators

Subject: Storage | June 13, 2016 - 03:46 AM |
Tagged: XPoint, tlc, Stony Beach, ssd, pcie, Optane, NVMe, mlc, Mansion Beach, M.2, kaby lake, Intel, imft, Brighton Beach, 3DNAND, 3d nand

A recent post over at benchlife.info included a slide of some significant interest to those who have been drooling over XPoint technology:

View Full Size

For those unaware, XPoint (spoken 'cross-point') is a new type of storage technology that is persistent like NAND Flash but with speeds closer to that of RAM. Intel's brand name for devices implementing XPoint are called Optane.

Starting at the bottom of the slide, we see a new 'System Acceleration' segment with a 'Stony Beach PCIe/NVMe m.2 System Accelerator'. This is likely a new take on Larson Creek, which was a 20GB SLC SSD launched in 2011. This small yet very fast SLC flash was tied into the storage subsystem via Intel's Rapid Storage Technology and acted as a caching tier for HDDs, which comprised most of the storage market at that time. Since Optane excels at random access, even a PCIe 3.0 x2 part could outmaneuver the fastest available NAND, meaning these new System Accelerators could act as a caching tier for Flash-based SSDs or even HDDs. These accelerators can also be good for boosting the performance of mobile products, potentially enabling the use of cheaper / lower performing Flash / HDD for bulk storage.

View Full Size

Skipping past the mainstream parts for now, enthusiasts can expect to see Brighton Beach and Mansion Beach, which are Optane SSDs linked via PCIe 3x2 or x4, respectively. Not just accelerators, these products should have considerably more storage capacity, which may bring costs fairly high unless either XPoint production is very efficient or if there is also NAND Flash present on those parts for bulk storage (think XPoint cache for NAND Flash all in one product).

We're not sure if or how the recent delays to Kaby Lake will impact the other blocks on the above slide, but we do know that many of the other blocks present are on-track. The SSD 540s and 5400s were in fact announced in Q2, and are Intel's first shipping products using IMFT 3D NAND. Parts not yet seen announced are the Pro 6000p and 600p, which are long overdue m.2 SSDs that may compete against Samsung's 950 Pro. Do note that those are marked as TLC products (purple), though I suspect they may actually be a hybrid TLC+SLC cache solution.

View Full Size

Going further out on the timeline we naturally see refreshes to all of the Optane parts, but we also see the first mention of second-generation IMFT 3DNAND. As I hinted at in an article back in February, second-gen 3D NAND will very likely *double* the per-die capacity to 512Gbit (64GB) for MLC and 768Gbit (96GB) for TLC. While die counts will be cut in half for a given total SSD capacity, speed reductions will be partially mitigated by this flash having at least four planes per die (most previous flash was double-plane). A plane is an effective partitioning of flash within the die, with each section having its own buffer. Each plane can perform erase/program/read operations independently, and for operations where the Flash is more limiting than the interface (writes), doubling the number of planes also doubles the throughput. In short, doubling planes roughly negates the speed drop caused by halving the die count on an SSD (until you reach the point where controller-to-NAND channels become the bottleneck, of course).

View Full Size

IMFT XPoint Die shot I caught at the Intel / Micron launch event.

Well, that's all I have for now. I'm excited to see that XPoint is making its way into consumer products (and Storage Accelerators) within the next year's time. I certainly look forward to testing these products, and I hope to show them running faster than they did back at that IDF demo...


June 13, 2016 | 05:21 AM - Posted by John H (not verified)

Allyn - does this point towards no dimm version of xpoint for KabyLake?

June 13, 2016 | 08:16 AM - Posted by CB (not verified)

Don't we still have an interface problem?

NVME flash is already close to saturating Pci-e x4. And I don't see a huge benefit with making DIMM versions of X-point. The speed would be fast, but capacity is limited to what, 128GB or 64GB on 8 and 4 dimm boards? And you'll have to give up memory capacity to use it.

We need Intel to make interface improvements along with X-point. A new chipset that does more than add 4 pci-e lanes is needed. Maybe special extra dimm slots that support extra large capacities?

June 13, 2016 | 10:15 AM - Posted by MRFS (not verified)

> Don't we still have an interface problem?

Yes!

At a minimum, future chipsets should adopt a policy
of "syncing" chipset clocks with storage transmission
clocks.

Thus, SATA-IV could have been oscillating
at a minimum 8 GHz by now -- to "sync" with PCIe 3.0.

Secondly, USB 3.1 set an excellent example by
also adopting a "jumbo frame" i.e. 128b/132b.

PCIe 3.0 adopted a 128b/130b jumbo frame
(130 bits / 16 bytes = 8.125 bits per byte).

We've been advocating "overclocked storage" for
a long time now, e.g. at least "pre-sets" like
8G, 12G and 16G, maybe with jumpers.

June 13, 2016 | 11:02 AM - Posted by John H (not verified)

Yes and No -- NVMe flash is close to saturating PCI-E x4 (32Gbps?) in sequential speeds, but still has headroom for smaller chunks of data (i.e. Random I/O or 4KB - the kind of stuff that makes the OS feel responsive).

6 Gbps SATA was saturated first at 550 MB/sec.. but IOPs went from 45K to about 100-110K before finally hitting it's own limit on 6 gbps.

XPoint though should saturate 4 lanes of PCI-E with even the small operations.. The DIMM version would allow for much higher bandwidth -- improving both sequential performance and also the smaller 4KB / IOPs (responsive) performance because of that max bandwidth increase.

I'm sure Allyn can explain better than this..

June 13, 2016 | 11:28 AM - Posted by Allyn Malventano

John here has nailed it. Even with only a pair of PCIe 3.0 lanes, random access on Optane should easily outperform any flash-based stuff. Seven years ago, a PCIe 2.0 x1 RAM-based SSD was able to pull off nearly 300,000 IOPS in *single sector (512B) random reads*, and that was without NVMe and with things like pegging CPU threads happening on the other older hardware in the system. Don't underestimate the speed gains to be had from this tech, even over slower interfaces. Not having to do block erases of 6+ MB every time you write 4K (worst case) is a huge bonus.

The other issue at hand is that you have to have something to do with those IOs. Even with NVMe, a relatively simple workload application thread can peg its thread at 200k IOPS. Factor in that you actually have to *do* something with that data that's coming in or going out, and you realize that systems probably aren't going to hit the storage with that much total load. The key, however, is getting the latency of those IOs as low as possible, and that's where XPoint is going to clean house.

Switching over to DIMM-based is going to take a while. You're talking motherboard / UEFI / CPU level changes here to completely do it right, and making such a drastic change across so many components of a modern PC is going to take a while to implement. Combine that with what I said earlier and you'll realize there just isn't that much of a need to rush it. It's very telling when 2/3 of the first round Optane products are only on a PCIe 3.0 x2 link - Intel has obviously evaluated what bandwidth is really needed to see a benefit here, and if two lanes are good enough for large gains to be had, there's just no need to sacrifice a DRAM channel to that particular cause.

June 13, 2016 | 11:35 AM - Posted by Anonymous (not verified)

"Optane should easily outperform any flash-based stuff."

That's funny people still think Optane TM is not flash-based while this is only a marketing brand... :o)

June 13, 2016 | 12:03 PM - Posted by Allyn Malventano

Citation needed. You can't make a non-flash chip with byte-addressable writes and physically fit that much capacity onto a single die. Everything points to XPoint being a form of phase change memory, including the guy I know who worked on their development team.

June 13, 2016 | 12:19 PM - Posted by Anonymous (not verified)

Citation needed for what? Something that Intel never said officially?

Assuming Optane TM is PCM is pure speculation and not fact based.

Everything about Optane TM points to marketing, nothing more nothing less.

I know also some guys that know some other guys etc ;-)

June 13, 2016 | 01:16 PM - Posted by Fubar (not verified)

So we are not expecting "New 3D XPoint™ technology brings non-volatile memory speeds up to 1,000 times faster than NAND, the most popular non-volatile memory in the marketplace today." ?

https://newsroom.intel.com/news-releases/intel-and-micron-produce-breakt...

Or are you discussing about every Optane branded product instead of the ones that use Xpoint memory ?

June 13, 2016 | 02:28 PM - Posted by Allyn Malventano

Optane is Intel's XPoint brand. XPoint is not NAND. Optane SSDs could probably also include NAND (as I indicated in the above article), but NAND is not what Optane is all about.

June 13, 2016 | 04:43 PM - Posted by Anonymous (not verified)

"XPoint is not NAND"

Intel never communicated about what is (or not) XPoint technology (NAND, PCM, etc). That is a FACT! ;-)

June 13, 2016 | 06:04 PM - Posted by Allyn Malventano

Fine, but they also showed what it looked like, and that you can program and read from the same lines, which is *not* how Flash works. I used to teach electronics / electronics troubleshooting. I know just a little bit about this stuff...

June 13, 2016 | 05:08 PM - Posted by Anonymous (not verified)

Actually the fact that Intel's marketing used the word "SSD" to call their new Optane TM products with XPoint technology suggests that it is likely the Optane TM range will include (2D and 3D) NAND based products. If Intel XPoint technology would be PCM based, I would expect they called it with new device name like PCM Chip Drive (PCD).

That's my own opinion based on the current data available.

June 13, 2016 | 06:05 PM - Posted by Allyn Malventano

"Based on XPoint Technology"

SSD = Solid State Drive = no moving parts, which does not require that it be NAND.

June 13, 2016 | 06:25 PM - Posted by Anonymous (not verified)

Since Intel never clearly defined the XPoint technology (NAND Flash over PCI-E for example), this could be a NAND based technology. ;-)

Indeed, SSD doesn't imply necessarily to be NAND but I expect Intel's marketing to be clever enough to distinguish NAND based drive marketed as "SSD" from any potential PCM based drive in the mind of consumers.

After all why did Philips name their optical disc "Compact Disc" instead of "Record"?

June 14, 2016 | 01:02 PM - Posted by djayjp (not verified)

Intel has clearly and explicitly stated that XPoint *does not use transistors*, thus it is absolutely not NAND flash.

June 14, 2016 | 05:43 PM - Posted by Anonymous (not verified)

How could you make an array of PCM (or anything else tech) bits without switching transistors? Do you really know how memory chip work?

I would be very curious to see your official source for that statement...

June 15, 2016 | 10:55 PM - Posted by Allyn Malventano

The transistors are at the edges of the array. Source = I have a background in electronics, and that's just how something like this would work. Still not NAND.

June 16, 2016 | 05:53 AM - Posted by Anonymous (not verified)

So there are transistors to address bits in the array but still you can't prove the XPoint technology is not NAND based since Intel never explained how it works. Your electronic background won't change this fact anyway... ;-)

June 13, 2016 | 01:26 PM - Posted by MRFS (not verified)

Allyn, You are THE BEST expert to answer this
loaded question:

http://supremelaw.org/systems/nvme/HP.Z.Turbo.x16.version.jpg

Assuming we begin with HP's AOC with x16 edge connector
and 4 x NVMe M.2 slots, and assuming that an upgraded
PLX chip permits this AOC to support a full range of
bootable RAID modes, what would be your BEST EDUCATED GUESS
as to the storage density Optane provides for each
M.2 port on such an AOC?

Dell's initial design is very similar, differing
only with shorter "gum sticks":

http://supremelaw.org/systems/nvme/Dell.4x.M.2.PCIe.x16.version.jpg

(Somewhere inside Intel their engineers must already
be anticipating PCIe 4.0's 16GHz clock and 128b/130b
jumbo frame.)

If future AOCs's bandwidth is 31.5 GB/second -AND-
they support huge capacity increases e.g. 8TB total,
we need only factor in the aggregate controller
overheads to foresee storage performance increases
an order of magnitude better than what we have now.

Some people thought I was dreaming to foresee
non-volatile storage running at bandwidths
similar to DDR2-800 ramdisks.

Well, let's do the math:

DDR2-800 x 8 = 6.4 GB per second ("PC2-6400")

Let's say aggregate Optane controller overhead
is a conservative 20% in RAID-0, after upgrading M.2 lanes
to 16GHz per lane.

Then, 31.5 GB/s x 0.80 = 25.2 GB/second
(4 x M.2 sticks in RAID-0 mode, PCIe 4.0)

The latter number -- 25.2 -- is already FOUR TIMES the
raw bandwidth of DDR2-800 (25.2 / 6.4 = 3.93x).

June 13, 2016 | 03:08 PM - Posted by John H (not verified)

M.2 SSDs are already faster than the main memory bus of the Pentium 3.. :)

Most DDR2 systems though were 128-bit (dual channel), so that's 12.8 GB/sec (DDR2-800 = PC2-6400, or 6.4GB/sec per channel - look at a diagram for Intel P35 chipset). But you are right overall.

Very interesting though that PCI-E x16 is that close to main memory speed.. PCI-E 3.0 x16 (16GB/sec) = ~ DDR3-1066 dual channel. PCI-E 4.0 x16 (32GB/sec) = ~ DDR4-2133 dual channel.. No wonder we don't see PCI-Express in x32 format anywhere..

June 13, 2016 | 04:54 PM - Posted by MRFS (not verified)

What comes to mind immediately is a product sold by
Highpoint which bundles 2 x RocketRAID 2720SGL controllers
called "RocketRaid 2720C2":

http://www.newegg.com/Product/Product.aspx?Item=9SIA67S46V0318&Tpk=9SIA6...

(See the one Customer Review, in particular.)

Thus, assuming x16 NVMe RAID controllers do happen
with full support for modern RAID modes,
would a pair of such controllers perform
in a manner similar to dual-channel DRAM?

I'm not totally sure on this point, but I do know
that PCI-Express was originally designed to overcome
the concurrency limit imposed by the old PCI bus:

only one device could be using that old PCI bus
at any given moment in time e.g. Gigabit Ethernet NIC
(32 bits @ 33 MHz).

There may be a certain amount of parallelism
to be achieved by configuring 2 such NVMe RAID controllers
in the same PCIe 3.0 / 4.0 motherboard.

If one can do that with 2 x Highpoint 2720SGL controllers,
it should be possible with two x16 NVMe RAID controllers.

June 13, 2016 | 04:38 PM - Posted by Allyn Malventano

Sorry to bust your bubble, but last I heard, the HP Z-Turbo does not support bootable RAID. You'd have to boot from one of the four devices and could use the other three for a software RAID once in the OS. 

PCIe 3.0 already uses 128b/130b frames, and until folks run out of lanes, it is unlikely we will see rapid adoption of 4.0 until we run out of lanes / processing power to handle that many more IOPS.

 

June 13, 2016 | 04:57 PM - Posted by MRFS (not verified)

> HP Z-Turbo does not support bootable RAID

Forum users have also reported a "BIOS Lock"
which limits that card to certain HP workstations:

http://supremelaw.org/systems/nvme/HP.Z.Turbo.x16.version.jpg

The huge fan assembly also says something about
a need to overcome thermal throttling.

Thanks, Allyn!

June 13, 2016 | 06:08 PM - Posted by Allyn Malventano

I'm more inclined to think their fan is for the PCIe switch, as even with four m.2's next to each other, a simple heat spreader on that large of a housing would be sufficient for the 24W (6x4) dissipated by Samsung SSDs at max load.

June 13, 2016 | 06:48 PM - Posted by MRFS (not verified)

Good point. HP should modify that AOC
to allow the new Intel M.2 adapter to work with it.
See your review here:

http://www.pcper.com/news/Storage/Intel-Adds-M2-Adapter-Option-SSD-750-S...

June 13, 2016 | 09:44 AM - Posted by MRFS (not verified)

> A new chipset that does more than add 4 pci-e lanes is needed.

Indeed e.g. a new DMI link with x16 PCIe 4.0 lanes
will have an upstream bandwidth of 31.5 GB/second
(x16 @ 16GHz / 8.125 bits per byte).

and/or

an NVMe RAID controller with x16 edge connector
and four U.2 ports.

I'm assuming that HP, Dell and Kingston are making
progress with their PCIe cards w/ four M.2 slots:

http://supremelaw.org/systems/nvme/HP.Z.Turbo.x16.version.jpg
http://supremelaw.org/systems/nvme/Dell.4x.M.2.PCIe.x16.version.jpg

June 13, 2016 | 10:02 AM - Posted by MRFS (not verified)

Here's an article about Kingston's AOC:

http://www.tomsitpro.com/articles/kingston-e1000-ssd-nvme-liqid,1-3098.html

June 13, 2016 | 10:09 AM - Posted by MRFS (not verified)

Here's Intel's AOC with x16 edge connector and 4 x U.2 ports,
but it only works with a specific riser card
that only works in certain Intel motherboards:

http://supremelaw.org/systems/nvme/Intel.A2U44X25NVMEDK.jpg

June 13, 2016 | 09:49 AM - Posted by MRFS (not verified)

> And I don't see a huge benefit with making DIMM versions of X-point.

A triple-channel chipset could allocate one of those three
"banks" to Optane.

Then, by adding a "Format RAM" feature to the UEFI/BIOS,
the OS can be installed in persistent memory.

"Re-booting" takes on a whole new meaning when the
OS is already memory-resident, read "INSTANT ON".

June 13, 2016 | 11:07 AM - Posted by Anonymous (not verified)

XPoint on the DRAM module with its own background BUS from DRAM to XPoint for NVM storage write through from the DRAM to XPoint, for system crash recovery, or even some extra memory for paging files on the DRAM/XPoint card. With the combination DRAM/XPoint controller filling the paging requests directly from the XPoint NVM next to the DRAM memory. Each DRAM memory address could have a corresponding memory address mapped virtually to the its on DRAM module based XPoint with the XPoint holding some multiple of extra XPoint storage for Paging and even often used Applications.

And this could all be done in a background manner from the CPU with very little need for extra memory management activity on the part of the CPU/OS. So say each DRAM memory location gets its own backup to XPoint in the background and Each XPoint die having extra storage above the amount of actual DRAM available at a 10(XPoint) to 1(DRAM) or higher ratio with the OS having page tables that map into the XPoint for even faster paging responsiveness. Hell the Memory controller on the DRAM/XPoint combination memory could have its own ability to snoop the system BUS for page fault interrupts and be that much quicker to transfer the data/code from XPoint directly to the DRAM without the extra overhead from the OS having to handle the transfer request.

So XPoint NVM could become like a faster buffer for paging storage, and if the DRAM/XPOINT memory where actually part of the same DIE or die stack then whole blocks could be transferred form XPoint to DRAM, for some Future HBM like solutions for HBM with its own NVM stores. Micron will be making XPoint based memory also, so Imagine Micron making its own HBM2 with a little extra XPoint memory on the HBM2 Stacks, I'm sure that JEDEC could extend the standard for adding NVM XPoint/Other NVM to the HBM standards. So just think of some future GPU able to store all the the game's textures into some XPoint added to the HBM stacks and the GPU's memory controller able to manage the on HBM stack's NVM for lower latency VR gaming at 4k. Both Intel and Micron should try to add some XPoint enabled NVM storage right on the HBM die stack standard to the JEDEC standard for HBM with its own NVM storage capabilities.

June 13, 2016 | 11:31 AM - Posted by Allyn Malventano

These are all sorts of things that could happen down the road, but major changes take time. Had XPoint came about say 10 years ago, some of this might have already been happening.

June 13, 2016 | 12:47 PM - Posted by Anonymous (not verified)

Well then XPoint should be integrated faster into some of the Newer memory Standards like a JEDEC HBM to XPoint, for a by the memory block transfers, because there are going to be big AMD made HPC/Workstation/server APU's on an Interposer systems coming online over the next few years that will make use of HBM2/newer technologies. And U-Sam is throwing lots of exascale research dollars around. The potential for Technologies like XPoint from Micron and Intel to save on power usage for computing systems, especially for any systems using HBM2, stacked memory, is going to get U-Sam's exascale attention.

Micron especially could benefit by trying to integrate its XPoint NVM with an HBM2/modified JEDEC standard and trying to get with AMD's Exascale APU on an interposer module system/proposals for AMD's government funded exascale offerings. Those exascale systems are already delayed for excess power usage/metrics reasons using the older technologies, so those AMD APUs on an interposer package will have the highest available effective bandwidth at the lowest power usage owing to HBM's wide parallel interface and the potential to have a similar CPU to GPU wide parallel power saving interface etched into the interposer's silicon substrate.

Having a Bulk NVM Memory storage solution on the HBM memory stacks along with the HBM's DRAM directly addressable by the block via some wide parallel TSV types of connections directly from DRAM to NVM XPoint will save on power usage. Remember U-Sam is serious about funding and getting and exaflop capable computer system up and running and doing so under the exascale initiative’s power usage regimen for exascale systems, so if the power savings potential of XPoint is there then the funding can be applied for to get XPoint integrated into the HBM standard much faster than the market would normally allow via the Government funding of R&D for its exascale program.

It's also not just for AMD's exascale offerings, but for IBM/Nvidia's and Others offerings also, as Nvidia's GPU accelerators will be using HBM2 also, so having a JEDEC standard that includes XPoint and other Bulk NVM Memory technologies on the HBM stack will get the power savings needed for the Nvidia/Power9 based exascale systems to meet the strict exascale total power usage guidelines. Those exaflop rated computing platforms are all going to be using in memory storage solutions to cut down on the need to constantly transfer terabytes of data so keeping data in localized non volatile storage pools close to where the data is needed can save tremendous amounts of power.

June 13, 2016 | 05:05 PM - Posted by Anonymous (not verified)

While X-point has possibly significantly higher durability than flash based solutions, you still wouldn't want to use it exactly like DRAM. DRAM doesn't implement any kind of wear leveling (it isn't needed) and it could be written thousands if not millions of times per second. Even if it can handle millions of write cycles, it could be wear out rather quickly. I don't want such non-volatile devices even soldered on the main board, much less integrated into a very expensive graphics device. With memory stacking techniques we are going to have plenty of DRAM available. The non-volatile stuff needs to remain as secondary storage. With super fast secondary storage implementations, you can very quickly save and restore DRAM from secondary storage which should provide near instant on capabilities for mobile devices. They can also implement hybrid devices that can speed up loading a saved state back into DRAM by allowing a faster connection between devices on the same PCB.

For consumer level systems, it will be interesting to see what the architecture looks like. We haven't had a major change in memory topology in a long time. We have mostly just had evolutionary improvements in speed. Moving the memory controller directly onto the CPU die was the last major change really, and that was minor in comparison. We will be getting both HSA, which will unify graphics and system memory and HBM based APUs also. With even 8 GB of HBM2 (one stack) connected to an APU, the need for extra system memory is reduced significantly. HSA enabled devices will allow much more efficient use of memory capacity since you don't have to have two copies around anymore. This pushes off package memory out farther in the memory hierarchy. If the HBM acts as graphics memory and cache, then the off package DRAM starts to look more like secondary storage. You would access it in larger chunks, although possibly not as large as current page sizes for swapping to disk. I am thinking something like the m.2 form factor with with PCI-e 4.0 which can have varying types of memory technology on board. With stacked DRAM, an m.2 sized device could hold a huge amount of memory. Such a standard could include DRAM, flash, X-point, or other future memory technologies.

June 13, 2016 | 05:49 PM - Posted by Anonymous (not verified)

"Even if it can handle millions of write cycles, it could be wear out rather quickly."

Don't give good ideas to Intel! :-D

Brian Krzanich would smoke cuban cigar with Raja Koduri at Pattayal brothel...

June 13, 2016 | 07:03 PM - Posted by Anonymous (not verified)

No the XPoint on the HBM stack would be managed by a specialized controller in the bottom logic Die in addition to that which manages the HBM's DRAM, so the extra NVM logic would be there to manage the load/wear leveling in an NVM/XPoint friendly way, while also managing the DRAM in its specialized way. Any DRAM write-through to XPoint would be done in the background on the HBM DIE with DRAM to XPoint managed like a tiered storage solution, with algorithms to manage the actual movement of data to and from the HBM's DRAM die to the HBM's XPoint DIE/s on the stack using the HBM's controllers(bottom Die).

Look at what Samsung put on a single DIE, both NVM(NAND)/controller/DRAM cache all on one die. It would be better to engineer a wider data path solution to allow for entire blocks to be written from the HBM's DRAM dies stack to the HBM's NVM bulk XPoint memory with the XPoint's capacity set at many times the available HBM's DRAM size so there can be a level of over-provisioning and extra memory for paging files to be stored right on the same memory stack where the paging swaps can be made to the respective corresponding HBM DRAM. This could be handled by the HBM's controller under command from the OS for page table swaps with any DRAM write-through memory mirroring handled by the HBM controller/s without any need for OS intervention, in the background for system crash/fault tolerance purposes.

The goal is to use the XPoint on the HBM stack for its NVM ability and for XPoint's higher than NAND flash speeds with which to complement the HBM's DRAM with some in memory storage for much faster system responsiveness via having plenty of NVM storage right in the HBM memory stacks, or on a DRAM DIMM module, etc. There would be plenty on XPoint memory for a greater than 10 to 1 ratio of XPoint to DRAM ratio for enough over-provisioning to maintain a much more than 5 to 10 years of durability for most computing uses. And maybe even greater than 10 to 1 XPoint/DRAM ratio for some very in memory types of server workloads that may tax XPoint to a greater level.

The great thing about XPoint is that it is denser than DRAM, and with the right algorithms could save on the need for so much of the more costly DRAM HBM dies for some lower cost extra XPoint dies right on the HBM stacks to complement things in a NVM memory way. There are the exascale systems that will need to have some NVM memory right in the DRAM dies/HBM stacks to keep the data inter core/module transfers between CPU/other processors on separate core/modules to a minimum as the data sets are too large to be transferred over a network, so that In Memory NVM will be a great power saver, and time/latency saver for any system, consumer systems also(Think 4k/8k VR gaming)!

June 13, 2016 | 09:00 PM - Posted by Anonymous (not verified)

"Look at what Samsung put on a single DIE, both NVM(NAND)/controller/DRAM cache all on one die."

It isn't a single die, it is just a single package. The controller and DRAM cache are almost certainly on separate die. The Flash memory is definitely separate dies. I don't know if it actually uses TSV interconnect though. The number of connections to each flash die is actually quite small. Flash die have been stacked without using TSVs for a long time. They may have just added the controller and DRAM to the stack.

As for placing Non-volatile memory on an HBM die stack, I only see one reason to do it, and that is a small reduction in power consumption by having a wide, short run, low clock, and low voltage link. It takes more power to go off die with a narrow, long run, high clock, and higher voltage link. That isn't a very good reason though. We don't need that high of speed connection to the non-volatile storage. The whole system is designed to access secondary storage as little as possible. I think you under estimate how much it would be written in such a use case. That is one reason for integrating it, but quite a few reasons for not integrating it. I think a 32 GB HBM2 device will handle 4K gaming fine. By the time there is any installed base of 8k displays, we will probably have more than 32 GB available. The amount of memory per die is still going up.

For HPC and large scale compute, they will most likely use high speed links rather than trying to integrate non-volatile storage. Silicon interposer technology allows such links to be made on separate dies and placed on the interposer with a very high speed link to the rest of the die. Die stacking technology allows for mixing devices made on different process in a single stack. This allows the bottom logic die to be made on a logic optimized process while the actual DRAM die are made on a process optimized for DRAM. The silicon interposer also allows mixed die. This could allow for placing optical interface die (not easily made on standard CMOS) directly on the interposer. This will provide super low latency interconnect for building large scale systems. Such technology will most likely make integrating secondary storage onto the interposer unnecessary. There is a place for large X-point devices, but there are a lot of reasons why they probably shouldn't be integrated into the package. It could be very useful for holding large data sets or databases which need to be accessed fast, but will not fit in DRAM. The amount of DRAM on an HPC system can already be in the terabyte range, and that isn't even a large scale system. You are not going to want to integrate multiple terabytes of X-point onto the processor package.

For the generation beyond HBM2, AMD just says "scalable, next gen memory". I think this could actually be a distributed memory system where the GPUs are actually placed in the stack with the memory. The amount of connectivity possible in the stack is very large, probably larger than is available through the interposer. Current HBM2 die stacks are 92 square millimeters, so 4 of them will take up close to 400 square millimeters on the interposer. This places limits on the GPU(s) die sizes; interposer size can be increased, but it may not be cost effective. If you wanted 2 GPUs on an interposer, they would be limited to around 200 square millimeters each, and would be limited to 4 memory die total. For the generation beyond HBM2, they could place the GPU in the die stack for true 3D stacking. For thermal reasons, it would be best to place the GPU on top of the stack rather than on the bottom, unless some other method of cooling 3D chip stacks is created. The interposer would then be used for GPU-to-GPU (or APU-to-APU) interconnect rather than routing to separate memory stacks. A low-end model could just be a single stack.

June 13, 2016 | 09:54 PM - Posted by Anonymous (not verified)

An interposer bridge chip to connect one interposer module to another interposer module, and they could make larger Interposers. There will be no stacking of high heat generating processors, just memory like DRAM and HBM dies on the same stack. AMD wants to put some limited FPGA compute into the HBM stacks, so maybe adding some NVM memory to the HBM stacks also would make for some interesting in memory compute and NVM/XPoint storage. But with HBM2 memory in such short supply maybe adding some NVM XPoint memory dies on top of the HBM DRAM Dies will make for a nice compromise until the HBM2 supply chain catches up. NVM/XPoint Texture storage on the HBM's die stacked with both DRAM/NVM XPoint dies would be a nice usage example for adding XPoint to the HBM stacks and getting the textures swapped over to the DRAM portion in whole blocks through some TSV connections between DRAM and XPoint could make that happen quickly with any latencies properly hidden to to keep things going smoothly.

The games could pre stage an entire games worth of textures in the NVM/XPoint part of the HBM stacks ready to be transferred over by the memory block to the DRAM part of the HBM and probably get the job done with smaller amounts of total DRAM necessary, until the supplies of DRAM/HBM could be assured, and even afterwords because having the NVM capacity right on the HBM stacks will allow for a lot of interesting processing to be done if there is also some Programmable FPGA compute available on the HBM stacks to do some in memory compute from the NVM or DRAM dies by the FPGA.

The available HBM2 stocks are going to go towards the high margin HPC/Workstation/server SKUs simply because the extra money that can be paid by that market, so maybe on the consumer SKUs there can be some of that lower cost NVM/XPoint to go on the HBM stacks in place of the more costly DRAM, and both the markets will want the extra large NVM/XPoint storage on the HBM stacks for other obvious advantages of having DRAM/volatile memory right next to a larger NVM/XPoint data/code store right on the HBM stacks/DIMM module, etc.

The JEDEC standard could probably use an addition that adds to a revised NVM in HBM memory standard so there will definitely be a need to have some form of Stacked HBM DRAM/NVM(XPoint, other) JEDEC HBM standard that can allow for the DRAM dies to transfer whole Blocks in a single write cycle between NVM/XPoint HBM dies stack with both DRAM/XPoint dies on the same stack for uses like rapid Texture transfer storage for GPU/APU type systems on an interposer for the consumer markets, and the HPC/Workstation/server markets also.

June 13, 2016 | 02:06 PM - Posted by BlackDove (not verified)

Skylake Purley CPUs will support Optane XPoint DIMMs and have built in Omnipath fabric.

And whtly would Micron make HBM when they went to such lengths to develop HMC?

June 13, 2016 | 02:48 PM - Posted by Anonymous (not verified)

Because of the market for HBM, and HMC is no where to be found in as large a usage as HBM, what with everybody going for the JEDEC standard HBM way of stacked memory. Micron is in the memory business mainly so they need to go where the money is, and HBM will eventually be everywhere for AMD's APU based systems. Micron needs to get its XPoint out there because Intel will jump on that action too if there is enough potential to make money, and HBM is already in use, even more so with HBM2 for both AMD and Nvidia and others will be coming online with HBM based systems on a interposer! it's not restricted to just AMD's APUs on an interposer, there will be others making SOI(Systems on an interposer) with HBM. NVM(XPoint/others) mixed in with or integrated into DRAMs/HBM stacks is going to be a killer hardware application for the HPC/Server/Workstation markets and the consumer market for APUs/SOCs(SOIs)/GPUs for its ability to be located/integrated with any type of DRAM for lower latency in memory NVM storage, just you wait for that market and see, it will become popular very quickly.

June 13, 2016 | 07:56 PM - Posted by Anonymous (not verified)

Don't know if Micron will make HBM, but it is a JEDEC standard so I guess it is possible. HBM and HMC don't really compete directly with each other anyway. To match HBM bandwidth, you would need a huge number of HMC channels, which would be too expensive for consumer products. HMC is a more expensive solution overall, but it can provide much larger capacity compared to HBM. HMC supports chaining many stacks, so it can reach very large capacity, but it would also be incredibly expensive to implement such a solution. Such large capacities are not needed in the consumer market, but can be useful in HPC where they may need fast access to very large data sets.

I see enthusiast talking about HMC, but it is not suited to the consumer market. You could replace system memory with a few HMC channels, but as far as I know, HMC is not meant to run through a socket, so it would be limited to soldered on solutions. It might be competitive against GDDR5, but it does not compete with HBM. HBM can provide lower power and significantly more bandwidth. With HBM APUs, "system memory" may be a thing of the past anyway. If we have an 8 or 16 GB APU with fast non-volatile storage (several GB/s) for swap, then do we need any extra off package DRAM? Probably not for most people. For those that need more, they could implement essentially a RAM disk with some stacks of DRAM on an m.2 size device. If the memory is just swap space, then it doesn't need to be that fast.

June 13, 2016 | 11:05 AM - Posted by Anonymous (not verified)

Too much blah blah in the tech industry. The tech bubble is about to blow very soon... a first step with Microsoft overpaying LinkedIn 26,2 billion dollars. Did someone tell them that's only a web site with no revenue? Even worse with 166 million dollars of loss for the last fiscal year. That's the result of too much liquidity in the stock market thanks to the US federal reserve quantitative easing and scupid shareholders.

June 13, 2016 | 02:15 PM - Posted by MRFS (not verified)

Allyn,

Almost forgot to say:

THANKS! for staying on top of this story.

June 13, 2016 | 03:14 PM - Posted by Anonymous (not verified)

Hot News!

"AMD will showcase Zen running Doom at E3’s PC Gaming Show"

http://www.extremetech.com/gaming/230127-amd-will-showcase-zen-running-d...

June 14, 2016 | 09:56 PM - Posted by MRFS (not verified)

If anyone is interested, here is a summary of our two
primary workstations:

12GB ramdisk on both (RamDisk Plus software)
4GB for Windows 32-bit version on both
----
16GB DRAM installed in both

OS1 on RAID-0 of 4 x Samsung 840 Pro SSDs
OS2 on RAID-0 of 4 x SanDisk Extreme Pro SSDs

Both RAID arrays have 2 partitions:
C: for Windows + Application Software e.g. MS Office
E: for databases

Browser caches were moved to E: e.g. Firefox, Chrome etc.

Both ramdisks are saved at SHUTDOWN
and restored at STARTUP to/from E: .

Both CPUs are Intel quad-cores.

This build does take some time to STARTUP
and SHUTDOWN, because the system must
READ the ramdisk's contents from E:,
WRITE those contents to the ramdisk at STARTUP,
and reverse those steps at SHUTDOWN.

Next build will require at least 32GB of DRAM
because the 12GB ramdisk is almost full,
and we recently increased it to 13GB
with no ill effects on the remaining 3GB
available to Windows 32-bit.

p.s. A non-volatile DRAM subsystem should
eliminate the need to restore the ramdisk
at each STARTUP and to save the ramdisk
at each SHUTDOWN, hence our interest in
Intel's claims and progress with Optane.

June 15, 2016 | 08:15 AM - Posted by Anonymous (not verified)

Are you expecting people buy Optane TM as a luxury browser cache? It is completely insane!

However it could worth using it as file cache on servers like any other non-volatile memory.

June 15, 2016 | 10:46 AM - Posted by MRFS (not verified)

No. I would prefer to use it primarily in 2 ways
in our workstations:

(a) a memory-resident operating system
that does not need "booting"; and,
(b) memory-resident databases
that do not need "loading" from storage.

For the browser(s) that I use most often,
I'd keep their caches in a modern ramdisk.

For desktop search engines like COPERNIC,
both their databases and index files should
do well if stored in an Optane subsystem.

(We don't design servers here, so I am at best
a spectator when it comes to server design.)

Right now, we "mirror" our 12GB database
by keeping one identical copy in a ramdisk
and one identical copy in a RAID-0 array of SSDs:
XCOPY makes that very easy, using a batch file:

xcopy folder R:\folder /s/e/v/d
R:
xcopy folder E:\folder /s/e/v/d

August 28, 2016 | 09:39 AM - Posted by alexboyd

These are giant storage. Configuring these requires some experts to do it. Currently, im running this ( http://www.spectra.com/netapp/ ) A medium size unit that works well with multiple system which I only need. But this one is absolutely impressive.