NVIDIA Discusses Multi-Die GPUs

Manufacturer: NVIDIA

Performance not two-die four.

When designing an integrated circuit, you are attempting to fit as much complexity as possible within your budget of space, power, and so forth. One harsh limitation for GPUs is that, while your workloads could theoretically benefit from more and more processing units, the number of usable chips from a batch shrinks as designs grow, and the reticle limit of a fab’s manufacturing node is basically a brick wall.

What’s one way around it? Split your design across multiple dies!

View Full Size

NVIDIA published a research paper discussing just that. In their diagram, they show two examples. In the first diagram, the GPU is a single, typical die that’s surrounded by four stacks of HBM, like GP100; the second configuration breaks the GPU into five dies, four GPU modules and an I/O controller, with each GPU module attached to a pair of HBM stacks.

NVIDIA ran simulations to determine how this chip would perform, and, in various workloads, they found that it out-performed the largest possible single-chip GPU by about 45.5%. They scaled up the single-chip design until it had the same amount of compute units as the multi-die design, even though this wouldn’t work in the real world because no fab could actual lithograph it. Regardless, that hypothetical, impossible design was only ~10% faster than the actually-possible multi-chip one, showing that the overhead of splitting the design is only around that much, according to their simulation. It was also faster than the multi-card equivalent by 26.8%.

While NVIDIA’s simulations, run on 48 different benchmarks, have accounted for this, I still can’t visualize how this would work in an automated way. I don’t know how the design would automatically account for fetching data that’s associated with other GPU modules, as this would probably be a huge stall. That said, they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing.

View Full Size

If you’ve been following the site over the last couple of months, you’ll note that this is basically the same as AMD is doing with Threadripper and EPYC. The main difference is that CPU cores are isolated, so sharing data between them is explicit. In fact, when that product was announced, I thought, “Huh, that would be cool for GPUs. I wonder if it’s possible, or if it would just end up being Crossfire / SLI.”

Apparently not? It should be possible?

I should note that I doubt this will be relevant for consumers. The GPU is the most expensive part of a graphics card. While the thought of four GP102-level chips working together sounds great for 4K (which is 4x1080p in resolution) gaming, quadrupling the expensive part sounds like a giant price-tag. That said, the market of GP100 (and the upcoming GV100) would pay five-plus digits for the absolute fastest compute device for deep-learning, scientific research, and so forth.

The only way I could see this working for gamers is if NVIDIA finds the sweet-spot for performance-to-yield (for a given node and time) and they scale their product stack with multiples of that. In that case, it might be cost-advantageous to hit some level of performance, versus trying to do it with a single, giant chip.

This is just my speculation, however. It’ll be interesting to see where this goes, whenever it does.

July 5, 2017 | 09:37 AM - Posted by psuedonymous

The key difference from this to a regular Multi Chip Module like Threadripper/Epyc is that in an MCM, each die is an 'island', with explicit interconnects handing things off between dies. Basically the same situation as a multi-socket layout, but with everything squished reeeal close together. With the technique Nvidia have proposed, things are close to if you had snapped the die in half, pulled them apart, then re-linked all the hanging traces back together* (similar to how portable console hankers do at the macro-scale with oversized PCBs). As long as the longer path-length is taken into account, the device may not even be explicitly away that it is not monolithic.

*It's not quite that easy, and due to the need for signal amplification for longer traces it's 'easy' to implement consolidation into an explicit bus. Think of it by analogy: compare to using a PCIe extender ribbon (stretched monolithic die, totally transparent) to Thunderbolt (all PCIe but with an explicit translation layer, still transparent in almost all cases), to a protocol-translating extender (e.g. Pericom) passing PCIe over USB (not transparent, performance impact).

July 6, 2017 | 02:49 PM - Posted by RealExascale (not verified)

This is NUMA in package with L1.5 cache to reduce communication and help with data locality.

July 5, 2017 | 10:14 AM - Posted by Clmentoz (not verified)

Looks like the affordable process node die shrinks are running out any everybody is having to think about going mudular on an MCM or on an Interposer(Navi). One thing not discussed when comparing an Interposer to an MCM, other than cost differences, is that there will be active silicon based Interposers in the future with more than just passive traces etched into the interposer's silicon substrate. So with silicon interposers there can also be active circuitry like whole coherent connection fabrics and their controllers etched into the interposer's silicon substrate.

Both Interposer and MCM based modular GPU dies as well as CPU dies and HBM(Interposer only currently), FPGAs can be added for complete CPU/GPU systems and GPU/Other procesors for graphics/FP/DSP acceloration workloads.

There is currently a second round of US exascale research funding and the first round has funded some of the research linked to below. And this government funded exascale researc funding will produce IP that will be used in consumer prodcts

AMD's research for its Exascale APU takes interposers to the next level and also stacks the HBM directly on top of the GPU chiplets. (1)


"Design and Analysis of an APU for Exascale Computing"

July 6, 2017 | 02:48 PM - Posted by RealExascale (not verified)

The Japanese are going to beat the US, China and Europe to real exascale and it will be Fujitsu's ARM CPU based system. They are years ahead of anyone else with PrimeHPC FX100.

July 6, 2017 | 04:59 PM - Posted by Clmentoz (not verified)

Maybe so, but that Fujitsu(2) system will use a custom ARM SKU will be using the new ARM Scalable Vector Extension (SVE) instructions to get there.

"SVE makes the vector length a hardware choice for the CPU designer, allowing it to be anywhere from 128 bits to 2,048 bits in 128 bit increments.” (1)

And That Fujitsu/Japan project is also delayed until 2022 on later(3). And all that research/IP that the US government is funding is going to find its way down into consumer products, even if the fimal Exascale systems are a few years late in actually being assembled


"ARM Puts Some Muscle Into Vector Number Crunching"


"Inside Japan’s Future Exascale ARM Supercomputer"


"Fujitsu's billion-dollar ARM supercomputer delayed by up to 2 years"

July 6, 2017 | 05:08 PM - Posted by RealExascale (not verified)

Yes, and Intel has been talking about making lower core count CPUs for their Knights Landing successor Knights Hill as well. I comment on Next Platform often.

July 5, 2017 | 10:40 AM - Posted by Anonymous DEV (not verified)

Actually AMD was speaking about "scalability" for navi in 2015. With Ryzen CCX, ThreadRipper/Epyc MCM we know that they will do similar thing for GPUs and if You take that date into account AMD should have a scalable product next year. As for the distance between dies - it may play a role in speed, but I doubt they will try to squash dies to much together. They do MCM because big dies are expensive and squashing dies really close could also be hard and/or expensive. A coherent fabric even if spread a litle bit will not dramaticaly decrease performance, especially when gpu's have a lot of caches and tilers already need a lot of them, so with vega they have all parts in place and tested. Now they just need to work really hard on drivers.
As for Nvidia it also has a nice interconnect and got some know-how from IBM. Power9 NVLink/CAPI anyone... it's not hard to guess... The only player that goes different route is intel with still big dies and mesh interconnects (from larabee / xeon phi, etc)...
Also take note that AMD Vega is at least 10-11 months later than planned, most likely due to HBM2 low yield and high cost problems and also with huge changes in drivers/perf analysis tools etc. VEGA is a stop gap, but good for devs to learn and then use new uArch potential. I'm also thinking that the time between vega release and navi release will be really short like 6-9 months. That's why not so many vega cards will show on the market...

July 5, 2017 | 05:29 PM - Posted by BeepBeep2 (not verified)

AMD was indeed speaking about "scalability", Vega is also the first GPU SOC with Infinity Fabric.

In regards to your "coherent fabric even if spread a little bit will not dramatically decrease performance", that is not proven because we do not know what the internal bandwidth of Vega is and IF interconnect on EPYC is "only" a ~40 GB/s link. This is not nearly fast enough for uniform memory access or even full memory bandwidth in an MCM configuration. Zen is set up with IF between CCX's which have a latency penalty even between CCX's on the same die.

IF is fast, but not fast enough to avert a performance penalty, especially in GPU - latency is a huge issue when you are rendering frames every few milliseconds.

Navi is scheduled for 7nm, volume production at GloFo is scheduled for 2H18, so it will be at least 12 months before Navi. My guess is Q4 2018 with Zen 2 shortly after.

July 8, 2017 | 04:33 AM - Posted by James

Latency would not be as big of an issue for a GPU as it is for CPUs. GPUs stream huge amounts of data, so bandwidth is much more important, especially if it is more towards a tile base renderer. For doing a multi-die GPU without a silicon interposer, they could probably use almost the same infinity fabric design as Ryzen. I have seen some rumored products that combine a GPU with HBM and a 16-core CPU on an MCM. I think the article I read had four IF links to provide huge amounts of bandwidth between the CPU and GPU. Four links might be able to do 80 to 100 GB/s in each direction.

Also, there isn't anything stopping them from using both a silicon interposer and an MCM in a single device. The GPU can use HBM and then be mounted on an MCM with a CPU or more GPUs. They should be able to connect multiple GPUs together using multiple links. If they don't need single hop latency, then they could essentially chain them using 4 links between each GPU (8 total). Even though it is much more complicated packaging, it may still be cheaper than large monolithic die with very low yields. It could also exceed the compute power of a monolithic die anyway. The amount of bandwidth required should scale with the amount of compute, so I could see a silicon interposer with a small GPU and a single stack of HBM. AMD has already stated that HBM will act as a high bandwidth cache, which may allow the GPU to map memory on other GPUs and maybe address system memory in a cache coherent, unified manner also.

All of it may be able to leverage the exact same infinity fabric design currently in Ryzen. The topology Ryzen uses is optimized for latency though, not bandwidth. The Epyc processor is supposed to use a fully connected topology on package using three links per die to connect to the other three die. That probably wouldn't be optimal for a GPU though; it needs much higher bandwidth. If you have a single stack 8 GB HBM cache on each GPU, that will also reduce any latency penalty. GPUs are mostly reading data and a true virtual memory system for GPUs will allow each GPU to only bring in the data that it needs. AMD has indicated that Vega's memory architecture will make much more efficient use of memory capacity. A giant, monolithic die is not the way forward. Nvidia's giant gpu seems like just take an old GPU architecture and make a really big version of it. Big is not innovation by itself.

July 6, 2017 | 12:35 PM - Posted by Clmentoz (not verified)

AMD's Navi will be for GPUs what AMD's Zeppelin modular die based Epyc/threadrpper/Ryzen is for CPUs. With Navi being made up of easier to fabricate/higher yielding smaller discrete GPU dies, all wired up on an interposer(also Including HBM2 or HBM3) and other specialized processors as needed(Video codec, sound codec, specilized DSPs and FPGAs if needed for some workloads). Look at all the Ryzen/Threadripper/Epyc scalable SKUs that AMD was able to create by going modular.

The Interposer(Silicon Based) is required to achieve the needed trace density(Thousands to tens of Thousands) for HBM2/newer HBM memory and whatever direct Navi die to Navi die interposer traces are required. Navi will probably have some from of scalable/updated for GPUs on interposer Infinity Fabric/or specilized fabric topology that can support the massive amounts of bandwidth and cache coherency traffic needed by massively parallel shader/processor units that are standard on GPUs.

Navi may also make use of active silicon interposers rather than the current passive designs, with whole coherent inter-die connection fabrics and their assoicated controller active circuity etched on the silicon interposer’s silicon substrate.

If you look at any GPU from any maker that utilizes a large monolithic die you can see that even on these monolithic die designs that GPUs are already subdivided into independently functioning logical modular GPU units and their on monolithic die connection fabrics. And this current GPU monolithic die logical arrangement is ready made for being being divided up into smaller discrete dies(from the independently functioning monolithic die’s logical modular GPU units).

And the current monolithic die GPU design’s connection fabric can be easily moved onto a silicon based interposer and then have the very same independently functioning modular GPU units(made into separate dies) reattached via micro-bumps on an interposer or other methods via an MCM.

The Current GPU monolithic designs from AMD, Nvidia and others are already intrinsic to their independently functioning modular unit logical designs already 99% of the way there to being easily engineered into separate dies with their on monolithic die connection fabrics easily moved onto a silicon interposer’s substrate, it’s all silicon after all.

July 5, 2017 | 02:06 PM - Posted by LeeDoo (not verified)

Nvidia is discussing it, AMD will pioneer it first.
Same old story.

July 6, 2017 | 08:40 AM - Posted by psuedonymous

Just like GPU compute, tiled-rasterisation for desktop-class GPUs, machine learning, custom interconnects, transparent AFR multi-GPU, etc? Oh, wait...

July 6, 2017 | 12:06 PM - Posted by Rocky1234 (not verified)

Tile based Rasterization done in PowerVR 1996 yep.
Multi GPU done by 3DFX first then bought by Nvidia yep.
GPU compute hmmm Nvidia still kinda sucks at this AMD
had to show them how it's done yep.

July 8, 2017 | 03:11 PM - Posted by nomen (not verified)

you forgot:

T&L, G-Sync, Adaptive Sync, Fast Sync(not stollen by AMD yet!) post-processing AA, GPU virtualization, cloud computing, game streaming, GPU recording, Multi-Res Shading. Biggetst thing being off course deep learning in which Nvidia is so much ahead of everyone; and AMD being the latest to the party

July 5, 2017 | 05:16 PM - Posted by Alamo

yea the same concept as AMD's Navi arch, all AMD needs to bring this out is the 7nm node, i really hope glofo doesn't screw their plans again with a crappy node, or huge delay.
looking at Ryzen, polaris and vega, i think AMD should have payed the 300mil fine or so to glofo and moved their product to TSMC, their product would have been able to clock much higher and save them some of the bad press.

July 6, 2017 | 05:51 AM - Posted by Cyric (not verified)

The technology could be used not only as the article describes, connecting 2/4 GP100/102 dies together. But like Ryzen, connecting 4 dies one quarter the size of GP100 for nearly the same performance would be much cheaper production with higher yields.

July 6, 2017 | 03:21 PM - Posted by RealExascale (not verified)

The technology would not work with GP100 or GP102, or any existing architecture because it requires an L1.5 cache, on chip fabric, rebalancing the L2 cache and dedicated hardware to make all of that work transparently to the programs running on it.

This will likely not be seen in consumer GPUs for a while, because they arent really at the limits of lithography. V100 is infinitely more complex than GP102.

July 10, 2017 | 04:36 AM - Posted by Cyric (not verified)

Yes I completely agree. I referenced GPUs in matters of size not actual architecture.

July 6, 2017 | 05:26 PM - Posted by David Hawkins (not verified)

This is really interesting that Nvidia is looking to go this way as well, as this is exactly where I see AMD going with Navi, just makes a lot of sense in terms of cost and performance.

AMD have already shown how effective it can be with Ryzen, so the same principle should be easily replicated in GPU's, make one small, low power die, with say 2000SP's and that lets you use that die for your lowest end all the way to the highest end, very cost effective.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.