AMD Shows Off Zen 2-Based EPYC "Rome" Server Processor

Subject: Processors | November 7, 2018 - 11:00 PM |
Tagged: Zen 2, rome, PCI-e 4, Infinity Fabric, EPYC, ddr4, amd, 7nm

In addition to AMD's reveal of 7nm GPUs used in its Radeon Instinct MI60 and MI50 graphics cards (aimed at machine learning and other HPC acceleration), the company teased a few morsels of information on its 7nm CPUs. Specifically, AMD teased attendees of its New Horizon event with information on its 7nm "Rome" EPYC processors based on the new Zen 2 architecture.

AMD EPYC Rome Zen 2.jpg

Tom's Hardware spotted the upcoming Epyc processor at AMD's New Horizon event.

The codenamed "Rome" EPYC processors will utilize a MCM design like its EPYC and Threadripper predecessors, but increases the number of CPU dies from four to eight (with each chiplet containing eight cores with two CCXs) and adds a new 14nm I/O die that sits in the center of processor that consolidates memory and I/O channels to help even-out the latency among all the cores of the various dies. This new approach allows each chip to directly access up to eight channels of DDR4 memory (up to 4TB) and will no longer have to send requests to neighboring dies connected to memory which was the case with, for example, Threadripper 2. The I/O die is speculated by TechPowerUp to also be responsible for other I/O duties such as PCI-E 4.0 and the PCH communication duties previously integrated into each die.

"Rome" EPYC processors with up to 64 cores (128 threads) are expected to launch next year with AMD already sampling processors to its biggest enterprise clients. The new Zen 2-based processors should work with existing Naples and future Milan server platforms. EPYC will feature from four to up to eight 7nm Zen 2 dies connected via Infinity Fabric to a 14nm I/O die.

AMD Lisa Su Holding Rome EPYC Zen 2 CPU.png

AMD CEO Lisa Su holding up "Rome" EPYC CPU during press conference earlier this year.

The new 7nm Zen 2 CPU dies are much smaller than the dies of previous generation parts (even 12nm Zen+). AMD has not provided full details on the changes it has made with the new Zen 2 architecutre, but it has apparently heavily tweaked the front end operations (branch prediction, pre-fetching) and increased cache sizes as well as doubling the size of the FPUs to 256-bit. The architectural improvements alogn with the die shrink should allow AMD to show off some respectable IPC improvements and I am interested to see details and how Zen 2 will shake out.

Also read:

Intel unveils Xeon Cascade Lake Advanced Performance Platform

Subject: Processors | November 5, 2018 - 02:00 AM |
Tagged: xeon e-2100, xeon, MCP, Intel, Infinity Fabric, EPYC, cxl-ap, cascade lake, amd, advanced performance

Ahead of the Supercomputing conference next week, Intel has announced a new market segment for Xeons called Cascade Lake Advanced Platform (CXL-AP). This represents a new, higher core count option in the Xeon Scalable family, which currently tops out at 28 cores. 

cxl-ap.png

Through the use of a multi-chip package (MCP), Intel will now be able to offer up to 48-cores, with 12 DDR4 memory channels per socket. Cascade Lake AP is being targeted at dual socket systems bringing the total core count up to 96-cores.

UPI.jpg

Intel's Ultra Path Interconnect (UPI), introduced in Skylake-EP for multi-socket communication, is used to connect both the MCP packages on a single processor together, as well as the two processors in a 2S configuration. 

Given the relative amount of shade that Intel has thrown towards AMD's multi-die design with Epyc, calling it "glued-together," this move to an MCP for a high-end Xeon offering will garner some attention.

When asked about this, Intel says that the issues they previously pointed out with aren't inherently because it's a multi-die design, but rather the quality of the interconnect. By utilizing UPI for the interconnect, Intel claims their MCP design will provide performance consistency not found in other solutions. They were also quick to point out that this is not their first Xeon design utilizing multiple packages.

Intel provided some performance claims against the current 32-core Epyc 7601, of up to 3.4X greater performance in Linpack, and up to 1.3x in Stream Triad.

As usual, whether or not these claims are validated will come down to external testing when people have these new Cascade Lake AP processors in-hand, which is set to be in the first half of 2019.

More details on the entire Cascade Lake family, including Cascade Lake AP, are set to come at next week's Supercomputing conference, so stay tuned for more information as it becomes available!

Source: Intel
Manufacturer: NVIDIA

Performance not two-die four.

When designing an integrated circuit, you are attempting to fit as much complexity as possible within your budget of space, power, and so forth. One harsh limitation for GPUs is that, while your workloads could theoretically benefit from more and more processing units, the number of usable chips from a batch shrinks as designs grow, and the reticle limit of a fab’s manufacturing node is basically a brick wall.

What’s one way around it? Split your design across multiple dies!

nvidia-2017-multidie.png

NVIDIA published a research paper discussing just that. In their diagram, they show two examples. In the first diagram, the GPU is a single, typical die that’s surrounded by four stacks of HBM, like GP100; the second configuration breaks the GPU into five dies, four GPU modules and an I/O controller, with each GPU module attached to a pair of HBM stacks.

NVIDIA ran simulations to determine how this chip would perform, and, in various workloads, they found that it out-performed the largest possible single-chip GPU by about 45.5%. They scaled up the single-chip design until it had the same amount of compute units as the multi-die design, even though this wouldn’t work in the real world because no fab could actual lithograph it. Regardless, that hypothetical, impossible design was only ~10% faster than the actually-possible multi-chip one, showing that the overhead of splitting the design is only around that much, according to their simulation. It was also faster than the multi-card equivalent by 26.8%.

While NVIDIA’s simulations, run on 48 different benchmarks, have accounted for this, I still can’t visualize how this would work in an automated way. I don’t know how the design would automatically account for fetching data that’s associated with other GPU modules, as this would probably be a huge stall. That said, they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing.

amd-2017-epyc-breakdown.jpg

If you’ve been following the site over the last couple of months, you’ll note that this is basically the same as AMD is doing with Threadripper and EPYC. The main difference is that CPU cores are isolated, so sharing data between them is explicit. In fact, when that product was announced, I thought, “Huh, that would be cool for GPUs. I wonder if it’s possible, or if it would just end up being Crossfire / SLI.”

Apparently not? It should be possible?

I should note that I doubt this will be relevant for consumers. The GPU is the most expensive part of a graphics card. While the thought of four GP102-level chips working together sounds great for 4K (which is 4x1080p in resolution) gaming, quadrupling the expensive part sounds like a giant price-tag. That said, the market of GP100 (and the upcoming GV100) would pay five-plus digits for the absolute fastest compute device for deep-learning, scientific research, and so forth.

The only way I could see this working for gamers is if NVIDIA finds the sweet-spot for performance-to-yield (for a given node and time) and they scale their product stack with multiples of that. In that case, it might be cost-advantageous to hit some level of performance, versus trying to do it with a single, giant chip.

This is just my speculation, however. It’ll be interesting to see where this goes, whenever it does.

Ryzen Locking on Certain FMA3 Workloads

Subject: Processors | March 15, 2017 - 05:51 PM |
Tagged: ryzen, Infinity Fabric, hwbot, FMA3, Control Fabric, bug, amd, AM4

Last week a thread was started at the HWBOT forum and discussed a certain workload that resulted in a hard lock every time it was run.  This was tested with a variety of motherboards and Ryzen processors from the 1700 to the 1800X.  In no circumstance at default power and clock settings did the processor not lock from the samples that they have worked on, as well as products that contributors have been able to test themselves.

ryzen.jpg

This is quite reminiscent of the Coppermine based Pentium III 1133 MHz processor from Intel which failed in one specific workload (compiling).  Intel had shipped a limited number of these CPUs at that time, and it was Kyle from HardOCP and Tom from Tom’s Hardware that were the first to show this behavior in a repeatable environment.  Intel stopped shipping these models and had to wait til the Tualatin version of the Pentium III to be released to achieve that speed (and above) and be stable in all workloads.

The interesting thing about this FMA3 finding is that it is seen to not be present in some overclocked Ryzen chips.  To me this indicates that it could be a power delivery issue with the chip.  A particular workload that heavily leans upon the FPU could require more power than the chip’s Control Fabric can deliver, therefore causing a hard lock.  Several tested overclocked chips with much more power being pushed to them seems as though enough power is being applied to the specific area of the chip to allow the operation to be completed successfully.

This particular fact implies to me that AMD does not necessarily have a bug such as what Intel had with the infamous F-Div issue with the original Pentium, or AMD’s issue with the B2 stepping of Phenom.  AMD has a very complex voltage control system that is controlled by the Control Fabric portion of the Infinity Fabric.  With a potential firmware or microcode update this could be a fixable problem.  If this is the case, then AMD would simply increase power being supplied to the FPU/SIMD/SSE portion of the Ryzen cores.  This may come at a cost through lower burst speeds to keep TDP within their stated envelope.

A source at AMD has confirmed this issue and that a fix will be provided via motherboard firmware update.  More than likely this comes in the form of an updated AGESA protocol.

Source: HWBOT Forums