Manufacturer: NVIDIA

Performance not two-die four.

When designing an integrated circuit, you are attempting to fit as much complexity as possible within your budget of space, power, and so forth. One harsh limitation for GPUs is that, while your workloads could theoretically benefit from more and more processing units, the number of usable chips from a batch shrinks as designs grow, and the reticle limit of a fab’s manufacturing node is basically a brick wall.

What’s one way around it? Split your design across multiple dies!


NVIDIA published a research paper discussing just that. In their diagram, they show two examples. In the first diagram, the GPU is a single, typical die that’s surrounded by four stacks of HBM, like GP100; the second configuration breaks the GPU into five dies, four GPU modules and an I/O controller, with each GPU module attached to a pair of HBM stacks.

NVIDIA ran simulations to determine how this chip would perform, and, in various workloads, they found that it out-performed the largest possible single-chip GPU by about 45.5%. They scaled up the single-chip design until it had the same amount of compute units as the multi-die design, even though this wouldn’t work in the real world because no fab could actual lithograph it. Regardless, that hypothetical, impossible design was only ~10% faster than the actually-possible multi-chip one, showing that the overhead of splitting the design is only around that much, according to their simulation. It was also faster than the multi-card equivalent by 26.8%.

While NVIDIA’s simulations, run on 48 different benchmarks, have accounted for this, I still can’t visualize how this would work in an automated way. I don’t know how the design would automatically account for fetching data that’s associated with other GPU modules, as this would probably be a huge stall. That said, they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing.


If you’ve been following the site over the last couple of months, you’ll note that this is basically the same as AMD is doing with Threadripper and EPYC. The main difference is that CPU cores are isolated, so sharing data between them is explicit. In fact, when that product was announced, I thought, “Huh, that would be cool for GPUs. I wonder if it’s possible, or if it would just end up being Crossfire / SLI.”

Apparently not? It should be possible?

I should note that I doubt this will be relevant for consumers. The GPU is the most expensive part of a graphics card. While the thought of four GP102-level chips working together sounds great for 4K (which is 4x1080p in resolution) gaming, quadrupling the expensive part sounds like a giant price-tag. That said, the market of GP100 (and the upcoming GV100) would pay five-plus digits for the absolute fastest compute device for deep-learning, scientific research, and so forth.

The only way I could see this working for gamers is if NVIDIA finds the sweet-spot for performance-to-yield (for a given node and time) and they scale their product stack with multiples of that. In that case, it might be cost-advantageous to hit some level of performance, versus trying to do it with a single, giant chip.

This is just my speculation, however. It’ll be interesting to see where this goes, whenever it does.

Ryzen Locking on Certain FMA3 Workloads

Subject: Processors | March 15, 2017 - 05:51 PM |
Tagged: ryzen, Infinity Fabric, hwbot, FMA3, Control Fabric, bug, amd, AM4

Last week a thread was started at the HWBOT forum and discussed a certain workload that resulted in a hard lock every time it was run.  This was tested with a variety of motherboards and Ryzen processors from the 1700 to the 1800X.  In no circumstance at default power and clock settings did the processor not lock from the samples that they have worked on, as well as products that contributors have been able to test themselves.


This is quite reminiscent of the Coppermine based Pentium III 1133 MHz processor from Intel which failed in one specific workload (compiling).  Intel had shipped a limited number of these CPUs at that time, and it was Kyle from HardOCP and Tom from Tom’s Hardware that were the first to show this behavior in a repeatable environment.  Intel stopped shipping these models and had to wait til the Tualatin version of the Pentium III to be released to achieve that speed (and above) and be stable in all workloads.

The interesting thing about this FMA3 finding is that it is seen to not be present in some overclocked Ryzen chips.  To me this indicates that it could be a power delivery issue with the chip.  A particular workload that heavily leans upon the FPU could require more power than the chip’s Control Fabric can deliver, therefore causing a hard lock.  Several tested overclocked chips with much more power being pushed to them seems as though enough power is being applied to the specific area of the chip to allow the operation to be completed successfully.

This particular fact implies to me that AMD does not necessarily have a bug such as what Intel had with the infamous F-Div issue with the original Pentium, or AMD’s issue with the B2 stepping of Phenom.  AMD has a very complex voltage control system that is controlled by the Control Fabric portion of the Infinity Fabric.  With a potential firmware or microcode update this could be a fixable problem.  If this is the case, then AMD would simply increase power being supplied to the FPU/SIMD/SSE portion of the Ryzen cores.  This may come at a cost through lower burst speeds to keep TDP within their stated envelope.

A source at AMD has confirmed this issue and that a fix will be provided via motherboard firmware update.  More than likely this comes in the form of an updated AGESA protocol.

Source: HWBOT Forums