AMD Introduces Radeon Pro SSG: A Professional GPU Paired With Low Latency Flash Storage (Updated)

Subject: Graphics Cards | July 27, 2016 - 01:56 AM |
Tagged: solid state, radeon pro, Polaris, gpgpu, amd

UPDATE (July 27th, 1am ET):  More information on the Radeon Pro SSG has surfaced since the original article. According to AnandTech, the prototype graphics card actually uses an AMD Fiji GPU. The Fiji GPU is paired onboard PCI-E based storage using the same PEX8747 bridge chip used in the Radeon Pro Duo. Storage is handled by two PCI-E 3.0 x4 M.2 slots that can accommodate up to 1TB of NAND flash storage. As I mentioned below, having the storage on board the graphics card vastly reduces latency by reducing the number of hops and not having to send requests out to the rest of the system. AMD had more numbers to share following their demo, however.

From the 8K video editing demo, the dual Samsung 950 Pro PCI-E SSDs (in RAID 0) on board the Radeon Pro SSG hit 4GB/s while scrubbing through the video. That same video source stored on a Samsung 950 Pro attached to the motherboard had throughput of only 900MB/s. In theory, reaching out to system RAM still has raw throughput advantages (with DDR4 @ 3200 MHz  on a Haswell-E platform theroretically capable of 62 GB/s reads and 47 GB/s writes though that would be bottlenecked by the graphics card having to go over the PCI-E 3.0 x16 link and it's maximum of 15.754 GB/s.). Of course if you can hold it in (much smaller) GDDR5 (300+GB/s depending on clocks and memory bus width) or HBM (1TB/s) and not have to go out to any other storage tier that's ideal but not always feasible especially in the HPC world.

However, having onboard storage on the same board as the GPU only a single "hop" away vastly reduces latency and offers much more total storage space than most systems have in DDR3 or DDR4. In essence, the solid state storage on the graphics card (which developers will need to specifically code for) acts as a massive cache for streaming in assets for data sets and workloads that are highly impacted by latency. This storage is not the fastest, but is the next best thing for holding active data outside of GDDR5/x or HBM. For throughput intensive workloads reaching out to system RAM will be better Finally, reaching out to system attached storage should be the last resort as it will be the slowest and most latent. Several commentors mentioned using a PCI-E based SSD in a second slot on the motherboard accessed much like GPUs in CrossFire communicate now (DMA over the PCI-E bus) which is an interesting idea that I had not considered.

Per my understanding of the situation, I think that the on board SSG storage would still be slightly more beneficial than this setup but it would get you close (I am assuming the GPU would be able to directly interact and request data from the SSD controller and not have to rely on the system CPU to do this work but I may well be mistaken. I will have to look into this further and ask the experts heh). On the prototype Radeon Pro SSG the M.2 slots are actually able to be seen as drives by the system and OS so it is essentially acting as if there was a PCI-E adapter card in a slot on the motherboard holding those drives but that may not be the case should this product actually hit the market. I do question their choice to go with Fiji rather than Polaris, but it sounds like they built the prototype off of the Radeon Pro Duo platform so I suppose it would make sense there.

Hopefully the final versions in 2017 or beyond use at least Vega though :).

 Alongside the launch of new Radeon Pro WX (workstation) series graphics cards, AMD teased an interesting new Radeon Pro product: the Radeon Pro SSG. This new professional graphics card pairs a Polaris GPU with up ot a terabyte of on board solid state storage and seeks to solve one of the biggest hurdles in GP GPU performance when dealing with extremely large datasets which is latency.

View Full Size

One of the core focuses of AMD's HSA (heterogeneous system architecture) is unified memory and the ability of various processors (CPU, GPU, specialized co-processors, et al) to work together efficiently by being able to access and manipulate data from the same memory pool without having to copy data bck and forth between CPU-accessible memory and GPU-accessible memory. With the Radeon Pro SSG, this idea is not fully realized (it is more of a sidestep), but it will move performance further. It does not eliminate the need to copy data to the GPU before it can work on it, but once copied the GPU will be able to work on data stored in what AMD describes as a one terabyte frame buffer. This memory will be solid state and very fast, but more importantly it will be able to get at the data with much lower latency than previous methods. AMD claims the solid state storage (likely NAND but they have not said) will link with the GPU over a dedicated PCI-E bus. I suppose that if you can't bring the GPU to the data, you bring the data to the GPU!

Considering AMD's previous memory champ – the Radeon W9100 – maxed out at 32GB of GDDR5, the teased Radeon Pro SSG with its 1TB of purportedly low latency onboard flash storage opens up a slew of new possibilities for researchers and professionals in media, medical, and scientific roles working with massive datasets for imaging, creation, and simulations! I expect that there are many professionals out there eager to get their hands on one of these cards! They will be able to as well thanks to a beta program launching shortly, so long as they have $10,000 for the hardware!

AMD gave a couple of examples in their PR on the potential benefits of its "solid state graphics" including the ability to image a patient's beating heart in real time to allow medical professionals to examine and spot issues as early as possible and using the Radeon Pro SSG to edit and scrub through 8K video in real time at 90 FPS versus 17 with current offerings. On the scientific side of things being able to load up entire models into the new graphics memory (not as low latency as GDDR5 or HBM certainly) will be a boon as will being able to get data sets as close to the GPU as possible into servers using GPU accelerated databases powering websites accessed by millions of users.

It is not exactly the HSA future I have been waiting for ever so impatiently, but it is a nice advancement and an intriguing idea that I am very curious to see how well it pans out and if developers and researchers will truly take advantage of and use to further their projects. I suspect something like this could be great for deep learning tasks as well (such as powering the "clouds" behind self driving cars perhaps).

Stay tuned to PC Perspective for more information as it develops.

This is definitely a product that I will be watching and I hope that it does well. I am curious what Nvidia's and Intel's plans are here as well! What are your thoughts on AMD's "Solid State Graphics" card? All hype or something promising?

Source: AMD

July 26, 2016 | 02:26 AM - Posted by Tim Verry

I'm still not sure why the new cards are all blue, so strange hehe :).

July 26, 2016 | 03:46 AM - Posted by JohnGR

With people willing to pay $1000 for a single core Celeron if it comes with Intel's logo on it, I guess AMD had to make them feel a little better by adopting Intel's color. It will also be easier for the engineers to convince their clueless superiors, that the cards are Intel hardware and not AMD. "Look at the color. It is Intel's Blue".

July 27, 2016 | 04:00 AM - Posted by Nedjo (not verified)

It is NOT Intel's color! it is completely new shade of blue that has been invented few weeks ago:
http://www.npr.org/2016/07/16/485696248/a-chemist-accidentally-creates-a...

July 27, 2016 | 12:05 PM - Posted by BlackDove (not verified)

If thats YInMn blue, none of these sRGB gamut pictures will properly show it, as its outside of that gamut. It would maybe look accurate if it was a RAW being displayed in Adobe RGB, DCI-P3 or Rec.2020.

July 26, 2016 | 09:57 AM - Posted by Anonymous (not verified)

All the more reason to get an NVM addition to the JEDEC HBM standard. Just Imagine an HBM stack with a Micron XPoint NVM Die right on the HBM stacks to go along with the HBM DRAM Dies. They could connect up everything from the DRAM dies straight to the NVM die/s on the HBM stack with TSVs and have even better speeds without having to leave the HBM die stack. XPoint is denser than DRAM but still a little slower, but much faster than NAND. With NVM memory included with the HBM stacks and over provisioned enough to last for over a decade there would be some very fast and large amount of NVM storage for large textures and other uses for GPU's and APUs and the memory transfers would take place in the background on the HBM die stack/s and not take up any main processor to HBM bandwidth with the HBM bottom die controller logic managing the tiered HBM DRAM to NVM DIE(XPoint/other NVM) on HBM die stack NVM storage.

P.S. there are rumors of Intel/Others trying to aggressively buy out Micron, but Intel should never be allowed to get Micron because that would reduce the supplier of XPoint to one player(Intel)! XPoint has to remain outside of Intel’s total control or the XPoint costs will go up for lack of competition. Don’t let Intel corner the XPoint market.

July 26, 2016 | 06:26 PM - Posted by Anonymous (not verified)

While you could get very good speed to a stacked device, if that device is far out in the memory hierarchy, then that kind of speed is probably unnecessary. This is like current system architectures with DRAM swapped out to disk. There are orders of magnitude differences between a hard drive and an SSD, so it makes a difference whether you have a hard drive or an SSD. It doesn't make much of a difference which SSD you have though. Just about any modern SSD will seem very similar in actual use. There is still an order of magnitude difference between DRAM and SSDs, so it is still worthwhile to organize the system to touch the SSD as little as possible. This means changing the speed of the SSD has little actual effect on performance except for certain special cases. We have even reached the case where DRAM speed isn't that important. Since caching and prefetching are so effective, small changes in memory speed generally make little difference. I would still get a DDR4 based system, but I wouldn't worry much about the clock speed of that memory. I wouldn't pay much for faster ram either; it isn't good my to make much difference.

X-point is supposed to have much lower latency than flash, but it still isn't DRAM. It still has limited write endurance. Using it like DRAM would probably still wear it out. With a graphics card, you would generally be re-writing all of the memory at least every frame, if not more frequently. This would still lead to it being used in a similar maner as flash memory. Once you have HBM with all of the memory connections internal to the package, you can afford to run a lot of external connections. With an AMD Fury card, you have very little off die connections. You have the PCI-e x16 link, which is a high speed serial connection, and therefore low pin count. You have the display connections which are also high speed serial low pin count connections. The rest is mostly power and ground pins. They could easily run many extra pci-e links for non-volatile memory. From what I have read, x-point can achieve very low latencies even via the pci-e link with nvme.

July 26, 2016 | 07:04 PM - Posted by Anonymous (not verified)

Yes but the post is talking about XPoint right on the HBM stacks, possibly linked up to the HBM2 DRAM dies by TSVs, so a little wider connection than PCIe and very power saving, while also not using up all the available PCIe bandwidth. PCI is clocked high and takes more power. Also with some in memory NVM there is plenty of in the background methods to move data/code to and from the DRAM dies to the on HBM stack NVM die and not have the need to clog the system memory bus with traffic. I mean for the HBM/NVM JEDEC standard to offer plenty of wide parallel memory block transfers from HBM memory dies to HBM/NVM XPoint die via the TSVs.

This is NVM in memory and the connections could not be any shorter than dies stacked vertically and TSVs passing through both the DRAM dies and the NVM XPoint die. Talk about fast, for NVM that is. So many hundreds of TSVs and several independently operation ON HBM channels so there is plenty of HBM DRAM dies to NVM XPoint die bandwidth to keep things running with as little latency as possible. Not going to get that through any off interposer NVM or DRAM way out on the motherboard outskirts. XPoint is faster and more durable than NAND, and with over provisioning XPoint could probable last the useful life of the device for around 5-7 years.

July 26, 2016 | 08:25 PM - Posted by Anonymous (not verified)

The useful life would depend on how you access it. You would still need to access it infrequently compared to the DRAM which could be written a huge number of times per second. If you aren't going to access it that much, then reading it via an external connection isn't that costly on power. To get useful capacity to hold a large data set, you would still need a stack of multiple die. This would be taking valuable space that could be dedicated to DRAM. For a 32 GB HBM GPU or APU would you want to replace half of that HBM with non-volatile memory? That probably isn't a good trade-off vs. keeping the non-volatile memory external. X-point is supposed to be denser than DRAM, but I don't think it as dense as flash. Would it be a good idea to replace 16 GB of HBM DRAM with perhaps 64 or 128 GB of x-point? For a few applications, maybe, but for most applications, probably not. You can use the HBM as cache with terabytes of external non-volatile memory backing it up.

Anyway, HBM is an AMD/SK-Hynix thing, even if it is a JEDEC standard. X-Point is a Micron/Intel proprietary technology. Intel or Micron would need to develope a special x-point die with TSV along with their own HBM die for a mixed stack. The JEDEC standard does not specify anything internal to the stack; they woul need to develope their own or license the tech from Samsung or AMD/SK-Hynix. Even if they are going to do that, it would be years into the future. Other companies are working on PCM memory also, but those are probably also years into the future. Even if stacking non-volatile memory in the device is a good idea (I am not convinced), you are probably looking at 5 years or so in he future if ever.

July 27, 2016 | 12:06 PM - Posted by Anonymous (not verified)

So Micron sells AMD/SK-Hynix an XPoint die engineered to be added to HBM/HBM2 die stacks and the same sold to Samsung/others and their HBM/HBM2 products. And Samsung, AMD, SK-Hynix, Micron, Intel, everybody in the memory industry, are all members of JEDEC to f-ing begin with. And I'll bet that even Intel would sell a JEDEC HBM/HBM2/NVM compatible XPoint DIE if it meant profits. Micron will be Selling XPoint under the Micro brand, just like Intel has its Intel(Optane) Brand of XPoint. So it would not be hard for the JEDEC members to make an Updated JEDEC standard for a NVM Die(XPoint or NAND) to be added to the HBM die stacks and the controller for HBM/HBM2 is in the bottom logic die of the HBM's stack anyways, so just add an NVM controller, to go along with the DRAM controller that is already there in the HBM stack's bottom logic die. Wire the up with TSVs and plenty of XPoint(Overprovisioned) to last for at least 5-7 years, and XPoint memory is very dense! XPoint has bulk material properties(phase change like) like the core memory(magnetic) of old, so it's stored in the bulk Atoms/molecules and can be very densely packed compared to NAND that requires more space.

This type of in memory NVM(NAND currently) product is being some in some DDR DIMM packages already, so it's probably going to be added to a JEDEC HBM(with NVM on the HBM die stack) memory standard at some time also. The sooner the better for the entire GPU, CPU, and other processor markets that will be using HBM/HBM2. JEDEC is probably already looking at this. Hell AMD is already wanting to add FPGAs to the HBM/HBM2 die stacks for some in memory compute, so why not add a/some NVM memory die’s to the HBM stacks also, to complement the HBM DRAM dies.

July 27, 2016 | 01:57 AM - Posted by BlackDove (not verified)

What youre saying is only true of desktops, not supercomputers or large servers.

Skylake Purley CPUs and Knights Hill will both have XPoint interfaces and will use XPoint DIMMs.

July 27, 2016 | 12:08 PM - Posted by BlackDove (not verified)

Intel and Micron developed XPoint and HMC together and theyre part of the HMC consortium along with big players like Fujitsu. Good luck.

July 26, 2016 | 03:19 AM - Posted by Anonymous (not verified)

"It is not exactly the HSA future I have been waiting for"

I hear you, hopefully AMD has some interesting HSA features on at least the coming Opteron platform.

July 26, 2016 | 05:01 AM - Posted by Anonymous (not verified)

So how many write cycles does this flash hold?
Isn't it the case that after some heavy processing/writing action over many hours a day (Hospital) the flash goes kinda bad very soon?

July 26, 2016 | 05:02 AM - Posted by Anonymous (not verified)

or lets say after some time... which is kinda a reliability problem for a card that costs that much. Strange but also interesting idea of that card, looking forward to some testing or if it ever changes something in the field

July 26, 2016 | 12:57 PM - Posted by Anonymous (not verified)

I think it was Tomshardware that did a continous write/and rewrite stress test on 240GB ssd.

I believe the Samsung 840 Pro just kept going and going and going.

A pro variant of the 850 and later would be fine. With the 3D nand which is even more robust than the 840 pro.

July 26, 2016 | 12:57 PM - Posted by Anonymous (not verified)

That is if you use a 1TB, you should be fine.

July 27, 2016 | 02:33 PM - Posted by nanoflower (not verified)

The TechReport tested a number of SSDs and found they could get over a petabyte of data written and read from SSDs before seeing any serious issues. There's enough of a safety margin with current SSDs that wearing out the memory isn't a serious issue.

July 26, 2016 | 05:10 AM - Posted by Jann5s

Tim, does the card still have some GDDR5(X) directly connected to the GPU? or is the NAND completely replacing the GPU memory?

July 26, 2016 | 08:38 AM - Posted by Tim (not verified)

I think that it will have some amount of gddr5/x or HBM but no info on that to share yet! I dont think the ssd is replacing the memory completely. Im hoping to get more details soon.

July 26, 2016 | 03:35 PM - Posted by StephanS

the nand read speed is about 50 time slower then gddr5
and nand cant be written to efficiently, and GPU do MASSIVE amount of writes.

nand would die quickly...

you are looking at SSD on the same card as the GPU.
And the GPU is using its PCIe logic to control the SSD.

July 26, 2016 | 05:26 PM - Posted by Anonymous (not verified)

It is almost certainly set up like system ram which is swapped out to disk. For something like high end rendering, you would write all of the data for an entire scene, or many scenes with shared data, into the SSDs. Very little data would need to be written back to the SSDs. For rendering, and many other task, you read a huge amount of data but the results are generally very small. Most of the writes that GPUs do while rendering games is writing new data for each frame into memory before processing it. With that much memory available, a lot of data can just be left in the frame buffer, and evicted when space is needed.

July 26, 2016 | 05:18 AM - Posted by Anonymous (not verified)

1TB frame buffer?!?
If that memory is really being used like RAM, it is going to get thrashed quite heavily.
Given that there is a PCIe interface in the way, it is probably going to be used more for dumping large datasets into which then get fed into some GDDR at a guess.
Still it will be interesting to see how this gets used

July 26, 2016 | 05:39 PM - Posted by Anonymous (not verified)

It is probably set up like system memory which is swapped out to disk. Pages in active use would remain resident in the DRAM. GPUs already had a form of virtual memory that allowed them to swap data out to system ram and back in as needed. I don't know if that was used much for gaming code since any swapping woukd be detrimental to performance. This is the same as when the active set of system memory exceeds the amount of system memory and starts swapping to disk constantly. I would hope that they would have implemented a similar system for this memory set-up such that it can look like a flat frame buffer to user code. Hooks to allow user code to micro-manage the memory would also be good for HPC.

July 26, 2016 | 05:43 AM - Posted by Anonymous (not verified)

Cool! Can i install Fallout 4 on my GPU then?

July 26, 2016 | 11:36 AM - Posted by Anonymous (not verified)

Maybe AMD should add a Zen/Cores die to the interposer based GPU and HBM2 pair, and call it an APU on an interposer! I'm sure AMD could fit an 8 core Zen Die nicely besides the HBM2 dies that will surround the Big Vega flagship GPU die. And maybe AMD could get some XPoint NVM dies from Micron and get SK Hynix to add them to the HBM2 die stacks along with the DRAM dies and use TSVs to connect up the HBM2 stack's DRAM dies to the 3D XPoint die for some very fast NVM texture storage.

So users could be getting an APU on an Interposer on a PCI card to add some Extra CPU power to go along with the Big Vega GPU and HBM2 stacks with their own XPoint NVM capacity, and move the entire game/gaming engine and Gaming OS(Linux and Vulkan API based) to the PCI card based APU on an interposer. That would be a complete gaming system on a PCI card! The APU on and Interposer is coming, AMD is already planning some APU's on an interposer designs for the HPC/server/Workstation markets, so the consumer variants will follow!

That PCI SSD M.2/whatever NVM is a start, but a JEDEC HBM/NVM standard(Lots Of 3D XPoint NVM please) would be the next logical step in the evolution of the GPU, or APU on an interposer, designs from AMD, and let that NVM be XPoint added to HBM2 stacks and then things will get interesting!

July 26, 2016 | 06:57 PM - Posted by Anonymous (not verified)

I have seen you (or someone else; I don't know) post this idea before. I doubt that we are headed in that direction. Flash has too low of write endurance to want to use it in this manner. X-point is supposed to have significantly higher write endurance and lower latency, but this still seems to be a bad idea to integrate such a device into the package. I would think it would still be preferable to use this as an external device since it will still be farther out in the memory hierarchy than DRAM. I don't know how high PCIe latency is, but I suspect it is low enough to still make good use of and x-point device.

HMC actually could be complementary to HBM since HMC can provide more off package bandwidth than DDR memory variants at lower power. It can also provide the necessary capacity. HBM is obviously capacity limited. For HPC applications using large amount of data, they could connect large amounts of HMC to back up the HBM. The HBM would act like a large cache. Unfortunately, HMC is Intel/Micron and not a JEDEC standard, so AMD does not have access. Intel could use HBM though and actually might be. I though the FPGA company that they bought is using HBM.

For the consumer space, I would like to see some new m.2 form factor standards to allow an m.2 to contain different types of memory, both volatile and non-volatile. If PCIe is too high of latency, then perhaps they should look at something similar to HMC interconnect. It is a serialized interface that is probably quite similar to PCIe electrically. It would be great to be able to connect something with flash, x-point, or other new memory types. Other companies are working on PCM variants and other next gen memory technologies. Also, it would be good to allow volatile stacked DRAM. You could have a large amount of memory on an m.2 sized device with stacked DRAM. This would allow easy, compact memory's expansion for HBM based APUs. The DRAM would still be used as swap space though. This seems like a better solution that trying to integrate non-volatile memory types into the package. A large number of pins can be available for off package interconnect with the HBM interconnect all internal to the package. You could have hundreds of PCIe links on a high end device.

July 26, 2016 | 07:21 PM - Posted by Anonymous (not verified)

The server variants with APUs on an interposer are already in the design stages, and servers with PCI based processor cards are already in use, so AMD's GPUs on and interposer are already 2/3rds of the way there and it would be very easy to add some Zen cores onto the interposer and Call the whole thing an APU(On an Interposer). I'd like to see a Mac Pro get some of those workstation APUs on an interposer with 8/16 Zen cores, HBM2, and a big Vega pro GPU die, there would be no need for any big CPU SKU on the motherboard as the 2 cards would already have their own complement of Zen Cores. The system would essentially be a workstation cluster in a can.

July 26, 2016 | 09:25 PM - Posted by Anonymous (not verified)

There is still constraints on the space available on the interposer. Four stacks of HBM2 will take around 400 square mm. If Vega is a large GPU, there may not be much space for separate CPUs.

July 27, 2016 | 01:50 AM - Posted by Tim Verry

Yes, that was my thought as well, that there just would not be room for CPU + HBM2 + Vega, that would need a massive interposer and I think I remember Josh saying that right now they are a bit limited on how big they can reasonably make it...

July 27, 2016 | 12:44 PM - Posted by Anonymous (not verified)

Have you looked at the die shots of the populated Fury-X interposer and seen the unoccupied interposer space. There is enough space for more than 4 HBM stacks, or probably a Zen 4 core die per side for at least 8 zen cores. I'll bet that 2 Zen cores don't take up as much area as one HBM2 die. And the interposer itself is made of silicon so maybe some logic etched onto the interposer's silicon. They could also probably splice interposers to make designs for larger APUs on an interposer. If you look at any of chipworks die shots of GPUs, well GPUs are just bunches of modular units networked on a single monolithic die, so that could be broken down and made into individual Dies with the network etched into the silicon interposer instead(Navi may be made modular this way).

July 27, 2016 | 11:39 PM - Posted by Anonymous (not verified)

HBM1 is only about 40 square mm per die stack. HBM2 is about 92 square mm per die stack. There are ways to make the interposer larger, but interposers are already an expensive tech for the consumer space. The techniques that make them larger than the reticule are probably too expensive for a consumer product. The Nvidia GP100 (or whatever it is called) seems to be using over 1000 square mm of space with a 600+ square mm GPU and 4 HBM2 stacks (probably close to 400 square mm). It is not cheap. It is unclear what the actual reticule size limit is at this point. For the process AMD was using, I had heard the reticule size quoted as somewhere around 840 square mm. Nvidia's part is larger than this though. We will have to wait and see how big Vega is. There may still be room for some CPU cores, but there are limitations, both technological and economical. There will be trade offs between the number of CPU cores and the size of the GPU. I think silicon interposers are going to be one of the biggest changes in system architecture in a very long time, but when something like this comes out, people speculate wildly with little knowledge of the actual trade-offs and limitations involved. We only have one consumer level product so far so we don't even know if interposer based devices are economical in the consumer space yet.

July 28, 2016 | 10:06 AM - Posted by Anonymous (not verified)

"The techniques that make them larger than the reticule are probably too expensive for a consumer product", No Not with the Server/HPC market buying loads of interposer based GPUs(Currently) and AMD's HPC/Server/Workstation Interposer based APUs when they are to market. The high costs of R&D and the capitol equipment to build even larger interposer sizes will be very quickly amortized by the HPC/Server/Workstation market's higher margins offered for these interposer based processor systems! AMD is already 2/3rds of the way there with its Fury X designs to making a very powerful consumer gaming APU on an interposer by simply adding some Zen cores to any Polaris/Vega based variants of a system like the Fury-X! Those die shots of the populated Fury-X interposer showed plenty of remaining space for some Zen cores dies, and remember that Fury-X's GPU die was fabricated at 28nm, so any Polaris/Vega silicon is going to be smaller while still having more ACEs and CUs!

The very reason why HBM2 is not in the consumer realm as of yet from both AMD and Nvidia is that the HPC/Server/Workstation market’s money screams much louder than the consumer market’s limited discretionary budgets. The server/HPC/Workstation will pay top dollar for the HBM2 based GPU/APU accelerator and SOC parts and those economies of scale will eventually bring the Interposer/HBM2 costs down very quickly and efficiently once the R&D and new capitol/physical plant expenditures are fully amortized. Once the HPC/Server/workstation market’s appetites for the HBM2/Interposer parts are satiated then the consumer market with its lower margins will get access to the supplies of interposer/HBM2 parts, and those R&D and any new plant costs to make the larger interposers and HBM2 will have been fully amortized so they will become ubiquitous in the consumer market in short order after the HPC/Workstation/Server market’s needs are met.

That professional business market Interposer/HBM usage will pay the for the R&D/new physical plant costs with the consumer market’s overall HBM/Interposer contributions to R&D/other initial costs pretty much nil in comparison.

July 26, 2016 | 06:28 AM - Posted by K0MEPCA (not verified)

Ok a few clarifications are needed here. Yes, the card will still have GDDR/HBM because that huge "SSD" although low-latency will not even come close to the performance of RAM (duh!). Also, don't get your hopes up on getting the card even if you have $10000 burning a hole in your pocket (and no, that is NOT retail price). AMD stated they will sell this !developer version! (very important differentiation) only to people/companies that they believe will be able to contribute to polishing the technology they are trying to implement here. So in short if you want to score a card for testing/review you're in no luck because that's not contributing to the technology and AMD will not provide/sell you the card.

July 26, 2016 | 08:41 AM - Posted by Tim (not verified)

Good point that developer kits are limited to those professionals that can help them test and qa it with proper workloads. AMD did say expected full availability in 2017.

July 26, 2016 | 10:18 AM - Posted by Anonymous (not verified)

Developer kits always cost more because they are custom bits of technology for developers to get their software certified to run using the new technology. So the pricing of the developer SKUs will not reflect the final pricing of the mass produced SKUs that use this technology! AMD needs to Partner with Micron(3D XPoint/other NVM) and get the JEDEC adoption of an HBM NVM standard so NVM dies could be directly integrated into the HBM Die stacks.

July 26, 2016 | 07:19 AM - Posted by Anonymous (not verified)

It's an interesting idea, but even dual m.2 slots is only a 8x wide PCIe 3.0 bus width. For pure throughput, the PCIe 16x slot is still going to win out, so this only appears to make sense for a workload where the latency of accessing data over the main PCIe bus via the PCH is a bottleneck, rather than pure throughput (if throughput was your limit, than one or more PCIe 8x SSDs sitting in an adjacent slot would be faster, accessed directly via the PCH through DMA rather than via the CPU).

July 26, 2016 | 08:51 AM - Posted by Anonymous (not verified)

This is a godsend for video production. Very cool that it's expandable by the user.

July 26, 2016 | 09:05 AM - Posted by Anonymous (not verified)

4,500+ GB/s is awe inspiring!

Pair it with some nice SLC?

July 26, 2016 | 10:52 AM - Posted by Anonymous (not verified)

Put the NVM on the HBM die stack/s and make that NVM XPoint wired up to the HBM DRAM dies using TSVs and that would put any PCI connected SLC NAND to shame! It's time for a new JEDEC HBM DRAM with NVM mashup standard and some really fast NVM texture/other storage on the HBM die stack/s!

Make this happen AMD, Micron(XPoint), SK Hynix, Samsung, others, and JEDEC!

July 26, 2016 | 02:01 PM - Posted by StephanS

The very slow ram is accessed by a PCIe bus using the same method as a normal SSD over PCIe.

Yes, technically you might have minor advantages,
but that should be around 10% vs using the main PCIe 3.0 interface.

This sound like a total gimmick to me for the price.
If the cost was the same as adding a 1TB SSD I would say BRAVO.
But $9000 for a 1TB SSD... count me out.
I will use a high pcie SSD and get 99% of the performance.

Lets recall that the board connector can transfer 15GB .
4K 4:4:4 30fps is 711MB/s
So under 5% of the pcie3.0 connector per uncompressed channel.

(when its bidirectional, so during rendering, send the final rendered frame over back to the SSD is free)

Also most 4k/8k video is compressed (storage capacity is otherwise a problem) and h.264 and h.265 decoder on the GPU are 100% standard compliant, meaning a software decoder will decode binary identical video. In this case the bandwidth drop to near nothing.

July 26, 2016 | 05:39 PM - Posted by Anonymous (not verified)

If you think it's not a big deal, then it's safe to say it's not you.

Oh, and you edit raw uncompressed video, not compressed obviously.

July 28, 2016 | 01:57 PM - Posted by StephanS

"all" source material is compressed. Even with 1080p its rare to find any device that capture material raw.

So when you edit, in your timeline all the tracks are loaded with compressed media.

The GPU can easily decompress many channels of h.264 video.

etc... I worked on a few project that did just that.

BTW, H.264 can be visually lossless. So its very rare that people bother doing raw output.

Also we are talking about a SSD on an 8x PCIe bus. And nvme SSD sit on x16 pcie, and can to 10GB seconds.

So you can possibly a little latency, but you loose when transfer time.
So actually, it might actually be faster to fill 1GB of VRam via an nvme SSD and the onboard attached SSD.

My guess, is that this is more of a very special case card design for a datacenter like google for some very special database optimized system.

July 26, 2016 | 06:49 PM - Posted by Anonymous (not verified)

gimmick was the word people used when they referred to Mantle. Saying its not possible to use software trickery to get that much extra performance... Dont be one of those people.

July 26, 2016 | 07:41 PM - Posted by Anonymous (not verified)

This might be a very compelling product for some applications. At this point it seems like mostly a developement platform for later products though. With HBM, you have very high memory bandwidth, but it is very limited capacity. It will still be topping out at 32 GB for HBM2 based devices, and that is probably pushing the current limits of interposer size for 4 stacks of HBM2 and a large GPU. It is interesting that SSDs seem to be considered fast enough as a second level of off die memory. I am surprised that they didn't do off package DRAM instead, even if it was standard DDR4 registered ECC memory modules. With smart management of the memory though, it could be very effective, but it will involve some software developement. Although, just implementing a virtual memory type system with page swapping could be very effective for many applications.

I don't know how high of latency it is to access system memory through the PCIe bus, but I could imagine it being very low latency to access a local PCIe SSD. The hardware and/or software can be very latency optimized. There isn't much need to include all of the levels of abstraction that are required for accessing an SSD under an OS. The NVMe software stack is much lower latency than the old AHCI stack, but I don't know if this would even need to implement an NVMe software stack. It could be implemented mostly in hardware. I don't know how you would measure the absolute latency in either case though. The bandwidth would be limited to <8 GB/s for PCIe v3 x8, I believe, but perhaps that is enough. It should take some load off the rest of the system also. It it should be interesting, especially if future Intel X-point drives can be used.

July 26, 2016 | 10:13 PM - Posted by Anonymous (not verified)

The question is:
Why is the demo that AMD showed so much faster with the SSG compared to using an NVMe SSD which is just a regular part of the system ?
Where's the trick ?

In the demo it's mentioned that the transferrate reached 4.5 GB/s. A very good value for an m.2 SSD in RAID-0 mode. But why can't this be done over the PCIe bus ?

And at this rate the 1TB will be sucked dry in 3.8 minutes.
Assuming 90fps as shown in the demo it will give you exactly 13.6 minutes of 25fps-video you can edit.

July 27, 2016 | 01:54 AM - Posted by Tim Verry

Ryan Smith @ AnandTech guessed that the low 900 MB/s throughput on the system attached Samsung 950 may have soemthing to do with it being attached to the motherboard via the chipset and the eventual PCI-E x4 link being shared with all other chipset attached devices (like other storage drives). I'm not sure why it did not do very well in raw transfer speeds, even if it was higher latency reads should have been higher but I'm not sure. The Samsung 950 Pro in our review did better than 900 MB/s: http://www.pcper.com/reviews/Storage/Samsung-950-PRO-256GB-and-512GB-M2-NVMe-PCIe-SSD-Review/Sequential-Performance-HDTac

July 27, 2016 | 03:16 AM - Posted by Anonymous (not verified)

AMD has some marketing names for various DMA capabilities; DirectGMA and XDMA. I would assume that they are doing direct DMA transfers between the SSDs and the GPU. I thought AMD crossfire worked even if the PCIe slot for one of the graphics cards is off the chipset. When the destination is the video card rather than system memory with a chipset connected PCIe SSD, does it need to do two DMA operations; one to bring data into system memory and another to write it back out to GPU memory? All of the numbers we have are to get the data into system memory. Perhaps there is a software (or SSD hardware) limitation that prevents direct DMA between the devices if it is connected via the chipset.

July 27, 2016 | 03:41 AM - Posted by Anonymous (not verified)

Thanks for the Information.
So AMD actually puts a highend 10K US$ device into a regular cosumer board with m.2 Slot SSD and compares the transferates of the two... ^_^

So the questions we have to ask now are:
- Why is the performace of the onboard m.2 socket of that particular mainboard so slow ?
- Why does AMD compare an SSG with RAID-0 storage against a single m.2 socket with obvious no RAID capability ?

To eliminate this unfairness, we could instead put a PCIe x8 card with two m.2 sockets into a PCIe x8 slot (rsp. PCIe x16). Then populate the two m.2 sockets and let them also run in RAID-0 mode.
-> Equality restored !
Actually even a PCIe x4 Card would be enough to reach nearly 4GB/s throuput.

Concluding it seems that AMD wants to reduce complexity of configuring ultrafast SSD storage and close the gap until the new U2 Standard will be widely spread and we will have U2 connectors on mainboards and storage controllers as a standard (or might the industry even comes up with something different, which would reduce the size and cost of this huge U2 connector and cables, who nows ?).
So this SSG might not be intended for high-end systems, but rather something like easy-to-build workstations.
And as they say the minimum storage configuration is 1TB, so there is room for bigger configurations.

July 27, 2016 | 02:43 AM - Posted by Anonymous (not verified)

I would expect future HBM-based compute cards to include extra memory, although I would expect it to be closer to Intel's Knights Landing system rather than this SSD based system. Knights Landing uses a small amount of memory, 8 to 16 GB (according to Wikipedia), along with up to 384 GB of DDR4. I don't know exactly what the small, fast memory is. It is apparently stacked memory with many channels, but it is not on a silicon interposer and is not as fast as HBM. This SSD based system seems to mostly be a developement platform for making use of "far" memory. Perhaps they can use the same software interface when they have a product with extra DRAM. I don't know if there will be an actual non-dev kit product with this specific configuration.

July 27, 2016 | 09:15 AM - Posted by YTech

HBM-based components will increase in memory size. However for the Radeon Pro SSG, it's a matter of volatile and non-volatile memory.

For big data on the GPU, it is important to have the ability to store the data and have access to random access memory to compute data to reach the desired goal which is than returned to the stored data.

And when the task is completed, to return to the primary storage for CPU related task.

Think of it as video editing in a film studio. Most of the work will be on the GPU and when it's ready to produce the final consumer product, the CPU does the rest.

Access to both volatile and non-volatile memory for both processors is important and they should both have their own share.

July 27, 2016 | 11:52 AM - Posted by BlackDove (not verified)

Knights Landing uses Micron HMC RAM as its near memory, using Intels EMIB instead of a silicon interposer. Knights Landing has three addresing modes for the near and far memory.

Knights Hill is reportedly getting rid of the DDR4, reducing the core count, and having a higher Byte/FLOP ratio, since pre-exascale and exascale systems require more memory and interconnect bandwidth than petascale systems. It will also feature what Skylakes converged E5/E7 Xeons will as nonvolatile memory: XPoint.

The next big systems from Cray(Shasta) will use Knights Hill and XPoint.

This GPU with a built in SSD confuses me a bit, as it has the limitations of flash memory. As others have pointed out, im not sure that it offers much benefit over NVMe SSDs paired with normal GPUs.

Intel and Micron are working together and directly competeing with Nvidia and AMD, so i doubt youll see XPoint on any GPU. HMC and XPoint will likely be seen in Fujitsu's Post K exascale ARM system around 2020, and theyre aiming for 100x the application performance of K. I believe theyre also collaborating with Intel on their silicon photonic interconnect fabric for Tofu 3(Tofu 2 in the PrimeHPC FX100 is already optical).

If i were to create a GPU with lots of nonvolatile storage, id probably just use something like NVlink on a GP100 to talk to a flash controller and address a large SLC SSD.

July 27, 2016 | 02:50 PM - Posted by Tim (not verified)

Interesting about Knights Hill!

July 27, 2016 | 03:20 PM - Posted by Anonymous (not verified)

I didn't know that they had the EMIB interconnect tech available yet. This is basically a competing technology to silicon interposers. They may be able to scale it up to HBM level bandwidths; we will see.

This product allows the GPU to directly talk to the SSD via DMA. This should reduce latency significantly and increase the achievable bandwidth. If you connect this via NVLink, it would still need to bounce packets through the NVLink Host controller to the PCIe host controller, which would increase latency. NVLink is faster than two m.2 but so is a 16x PCIe link to the CPU.

Also, x-point could be used on these GPUs via an Optane m.2 device. It wouldn't be anywhere near as good as using Optane DIMMs, but it still could provide quite low latency scratch memory for big data sets.

July 28, 2016 | 01:50 AM - Posted by BlackDove (not verified)

EMIB is also currently in use on Altera FPGA SiPs and the bandwidth seems comparable to interposers. HMC has been in use for a while with Fujitsu SPARC XIfx and those have 480GB/s.

What i was suggesting with NVLink is to make the GPU itself capable of directly addressing the nonvolatile memory, rather than needing to go through multiple hops. I dont know enough about NVLinks protocols or capabilities to know how difficult it would be to add a controller to the GPU die and have it communicate directly with some nonvolatile storage.

Perhaps what Nvidia does with their ARM integration could be used similarly to make a GPU with direct access to TB of storage.

July 27, 2016 | 09:03 AM - Posted by YTech

Respond to the 27th update:

Reason that AMD didn't design a separate PCI-E SSD storage is because that additional card on a PCI-E lane will have to pass through the CPU in order to return data to the GPU. This is what AMD is trying to skip to reduce latency with big data.

Being on the same card as the GPU, allows the GPU to directly communicate with the storage and not count on the CPU to negotiate the data transmission.

This is why nVidia has SLI bridges for their GPU cards. It's to as much as possible cut on data latency when communicating with the CPU through the PCI-E controls. Where as CF counts on the CPU to give priority to the GPU requests.

July 27, 2016 | 01:10 PM - Posted by Anonymous (not verified)

Yes there are less transmission hops so less latency with the NVM on the same PCI card as the GPU. So the NVM on the GPU card means less transmission encoding/decoding steps and less usage of motherboard PCIe bandwidth if the GPU to NVM PCIe signals never have to leave the GPU’s PCB. It will be more latency for the CPU, to get at the NVM on the GPU card, but for the GPU having that NVM closer and more direct makes things much faster. The final evolution would be to have some NVM dies(XPoint or NAND) on the HBM stacks and wired up with TSVs if they can make sure that whatever NVM goes on the HBM stacks can last for 5-7 years minimum at heavy usage. AMD wants put FPGAs on the HBM2 stacks for some in memory compute, if you look at some of AMD’s exascale computing APU on an interposer proposals for the government exascale initiative.

July 27, 2016 | 02:55 PM - Posted by Tim (not verified)

Thanks for clarifying. That's how I thought it worked (gpu could work with ssd and not have to send requests out to system cpu to get data) but I was not 100% sure my thinking was accurate heh.

July 27, 2016 | 06:03 PM - Posted by Anonymous (not verified)

It sounds like it's similar to their XDMA technology, except they are doing DMA transfers directly between the GPU and the SSDs rather than between two GPUs. The data flow should be directly between the two; I don't know if there is any communication back to the CPU to initiate the transfers. It is unclear whether the same thing could be done with just both devices connected to the CPU. PCIe is not a shared bus, it is point to point links. If you run an x8 connection to the GPU and the other x8 to m.2 cards, traffic would still go through the CPU.

July 27, 2016 | 08:04 PM - Posted by Tim (not verified)

Hmm yeah it looks like direct CPU-free transfers would be possible using their xdma engine technology. I would think that it should be possible to pair a graohics card with a standalone adapter card hosting pci-e ssds so long as the card/controller also had the xdma support (this engine is firmware code and not dedicated hardware or am I wrong? I was looking around this article seemed to say it was not dedicated hardware?).

If possible I would think the standalone adapter approach would get very close to the same performance but the SSG having its own little PCi-e bus island I think would give it a bit of a latency advantage since it would not have tongo iut to the pcie host controller on the mobo.

hmm...

Well my battery is dying so Ill send this before I lose it hehe. Hopefuy its a coherent thought as im trying to type fast :)

July 27, 2016 | 08:05 PM - Posted by Tim (not verified)

http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/4 is the article i referred to

July 27, 2016 | 11:11 PM - Posted by Anonymous (not verified)

It seems like it should work no matter where the devices are connected as long as both devices are on the same PCIe root complex. More hops means higher latency though. I think most, if not all, of the current single CPU systems have a single PCIe root complex in the CPU; the chipset links are an extension if the root complex in the CPU. This is presumably what allows AMD cross-fire to work even if a card is connected via the chipset links. Multi-socket systems or AMD's Opteron processors that place two CPUs on the same package probably have multiple PCIe root complexes. Perhaps there are other limitations with this set-up since the SSDs do not have any special DMA units; I don't know.

July 28, 2016 | 10:36 AM - Posted by Anonymous (not verified)

Don't forget the Hot Chips Symposium this year! AT Flint Center, Cupertino, CA, Sunday-Tuesday, August 21-23, 2016.

From NVIDIA "Tegra-Next System-on-Chip" will be discussed, and Nvidia's "Ultra-Performance Pascal GPU and NVLink Interconnect" on 8/22 confrence day 1. From AMD "A New, High Performance x86 Core Design" on confrence day 2, 8/23. IBM's Power9 will be discussed, 8/23, confrence day 2. Plenty of other big names will be there including ARM Holdings talking about "ARMv8-A Next Generation Vector Architecture for HPC" and their ARM "Bifrost, the new GPU architecture and its initial implementation, Mali-G71"!

"Hot Chips: A Symposium on High Performance Chips"

http://www.hotchips.org/program/

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.