AMD Shows Off Zen 2-Based EPYC "Rome" Server Processor

Subject: Processors | November 7, 2018 - 11:00 PM |
Tagged: Zen 2, rome, PCI-e 4, Infinity Fabric, EPYC, ddr4, amd, 7nm

In addition to AMD's reveal of 7nm GPUs used in its Radeon Instinct MI60 and MI50 graphics cards (aimed at machine learning and other HPC acceleration), the company teased a few morsels of information on its 7nm CPUs. Specifically, AMD teased attendees of its New Horizon event with information on its 7nm "Rome" EPYC processors based on the new Zen 2 architecture.

View Full Size

Tom's Hardware spotted the upcoming Epyc processor at AMD's New Horizon event.

The codenamed "Rome" EPYC processors will utilize a MCM design like its EPYC and Threadripper predecessors, but increases the number of CPU dies from four to eight (with each chiplet containing eight cores with two CCXs) and adds a new 14nm I/O die that sits in the center of processor that consolidates memory and I/O channels to help even-out the latency among all the cores of the various dies. This new approach allows each chip to directly access up to eight channels of DDR4 memory (up to 4TB) and will no longer have to send requests to neighboring dies connected to memory which was the case with, for example, Threadripper 2. The I/O die is speculated by TechPowerUp to also be responsible for other I/O duties such as PCI-E 4.0 and the PCH communication duties previously integrated into each die.

"Rome" EPYC processors with up to 64 cores (128 threads) are expected to launch next year with AMD already sampling processors to its biggest enterprise clients. The new Zen 2-based processors should work with existing Naples and future Milan server platforms. EPYC will feature from four to up to eight 7nm Zen 2 dies connected via Infinity Fabric to a 14nm I/O die.

View Full Size

AMD CEO Lisa Su holding up "Rome" EPYC CPU during press conference earlier this year.

The new 7nm Zen 2 CPU dies are much smaller than the dies of previous generation parts (even 12nm Zen+). AMD has not provided full details on the changes it has made with the new Zen 2 architecutre, but it has apparently heavily tweaked the front end operations (branch prediction, pre-fetching) and increased cache sizes as well as doubling the size of the FPUs to 256-bit. The architectural improvements alogn with the die shrink should allow AMD to show off some respectable IPC improvements and I am interested to see details and how Zen 2 will shake out.

Also read:


November 8, 2018 | 01:43 AM - Posted by James

“... with each chiplet containing eight cores with two CCXs”

Has that been confirmed? I haven’t seen any information to really tell us wether it it a single 8 core CCX or two 4-core CCXs. A single 8 core seems more likely. I don’t think the cpu chiplets have an infinity fabric switch on die, but I could be wrong.

Will be interesting to see what is going on in that IO die. The on package links do not take up much die area on the zepellin die. This thing only has 8 on package links instead of 4 as a zepellin die has. They should be at much higher speeds though. The external links (IO and socket to socket) will have pci-e 4.0 speeds.

For off package they will only need to have 8 pci-e 4.0 x16 links (128 total). That is still a lot for a single chip, but it is a big die. This presumably works similar to Epyc 1. When in single socket, all will be pci-e IO. When in a dual socket system, 4 of the links will be used for socket to socket communication with 4 left from each to still supply 128 pci-e lanes. So, it must have 8 on package links for chiplets (I believe these are lower clocked 32-bit links on Epyc 1), 8 x16 pci-e 4.0 links, and 8 memory channels. That isn’t quite as massive as I first thought, but it still is a massive amount of IO and a massive switch. I am still wondering if there was sufficient space for some L4 cache. It might be pad limited with that many IO pad taking up space.

It will be interesting to see what they do for the consumer parts. Will they use the same chiplets and have a two die low-end part (chiplet plus tiny IO die)? What will they do for Threadripper parts? It is also unclear whether they will make parts with a lower number of chiplets or with a lower number of cores per chiplet. Four, six, and 8 core chiplets could be used to make 32, 48, and 64 core parts. This makes the best use of the hardware, but they may have Epyc parts that had issues with the packaging process such that they have to disable die. A 32-core part with 4 active chiplets would be exceptional still. They may also have salvage IO die where some links are not functional. The original Threadripper sounded like it was a last minute project and they are obviously very focused on enterprise, so who knows what they will do for Threadripper 3000.

November 8, 2018 | 11:30 AM - Posted by SomeIPBlocksWillStaySomeWillBeMoved (not verified)

Read this Wikichip Fuse article and each Zen2 Die/Chiplet will still have to have a SDF(Scalable Control Fabric) and the SDF(Scalable Data Fabric) which makes up the Infinity Fabric and look at the CCM(Cash Coherent Master) those will still be needed if there are CCX units on the Zen2 Die/Chiplets any maybe still even if not. And then there is the CAKE(Coherent AMD socKet Extender) the IOMS that connects up to the I/O Hub, then the UMC(Unified Memory Controller) etc. And many of these functional blocks from the First Generation Zen/Zeppelin Full SOC die will be moved over to the Large Centralized I/O Die on Zen-2/Epyc-2's new layout.

So the new Zen-2 micro-arch beased Die/Chiplets are still going to have an SDF, SCF that are the Infinity Fabric and the related CAKEs to connect the Infitity Fabric from a CAKE on the Zen-2 Die Chiplet to a CAKE on the Large Central I/O Die. The UMC is going to be moved onto the Central I/O Die and most likelly all the PCIe, SATA and other I/O HUB related IP that's not be included on the Zen-2 Die/Chiplets for the new Zen-2/Epyc-2 MCM Layout.

It's easy to see what IP may have to remain on the Zen-2 Die/Chiplet and what IP will be moved to the Central I/O die but what is hard to figure out is if the Zen-2 Die/Chiplets will be broken up into 2 CCX units of 4 cores each that share an L3 cache.

The CAKEs and the IFOP(Infinity Fabric On Package) IP will be on both the Zen-2 Die/Chiplet side and the I/O Central Die side so that there can be Infinity Fabric traffic between the respective Zen-2 Die/Chiplet and the Central I/O die. And the IFIS(Infinity Fabric Inter Socket) will probably be moved to the Central I/O Die.

It really all dependes on what AMD will be using for any Zen-2 based Consumer Chips that may retain the same full SOC design as was used on 1st and 2nd Ryzen designs.
Or will all of AMD's consumer and server lines make use of the newer layout with Zen-2 Non Fully SOC Die/Chiplets surrounding a central I/O die. Not enough is known at this point in time to answer all the questions that need to be known.

Then there is the speculation that that Central I/O die looks a little too big and maybe there is an L4 cache on there also to help improve latency on memoty requests. And as far as the Infinity Fabric is concerned with that Big I/O die there is plenty of room to add the extra circuitry needed to fully decouple the Infinity Fabric's clock domain from the Memory Controller's Clock domain if desired.

(1)

"ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packaging"

https://fuse.wikichip.org/news/1064/isscc-2018-amds-zeppelin-multi-chip-...

November 8, 2018 | 11:33 AM - Posted by SomeIPBlocksWillStaySomeWillBeMoved (not verified)

Edit: SDF(Scalable Control Fabric)

To: SCF(Scalable Control Fabric)

November 8, 2018 | 10:17 PM - Posted by James

It all depends on whether the cpu chiplet is 2 CCX or one CCX. If it is a singe 8 core CCX, then there is no need for an infinity fabric switch on the cpu chiplet. There is nothing to switch; it just has the one connection. This might be a good way to go to reduce latancy.

For a single Zen die, a local memory access looks like this (everything on die):

Memory controller <—-> Fabric switch <—-> CCX

For “1 hop” it is (<===> is off die connection):

Memory controller <—-> Fabric switch <===> Fabric switch <—-> CCX

You have a max of 1 hop in a single socket Epyc. For Zen 2, they are all connected with a direct link to the IO die. If there isn’t a fabric switch on the cpu chiplet, then all memory accesses for all die look like this:

Memory controller <—-> Fabric switch <===> CCX (on chiplet)

This looks more like a single die local access except one connection is an off die link. This should reduce overal latency in Epyc 2 significantly. Zen 1 has a 256-bit crossbar switch on each die. This is to handle a 128-bit ddr memory ccontroller (2 channel) at full bandwidth. The die to die interconnections are also scaled to handle similar bandwidth (narrower interfaces, but much higher clock). This basically limits any one core to the bandwidth provided by two memory channels. Zen 2 needs more bandwidth with the doubled AVX throughout. Because of this, I would expect the crossbar will be at least doubled in width to 512-bit or double clocked. This will be necessary to manage bandwidth from pci-e 4.0 link speeds. Going to 512-bit or double clocked means that a single core should be able to stream full bandwidth of 4 memory channels rather than just 2.

If they do have a fabric switch on the cpu chiplet, then the latency could still be very low due to the much higher clock speeds. Pci-e 4.0 links will click significantly higher than the previous generation. The latency starts getting very low at such high clocks.

I thought the fabric switch on the IO die will be would be massive. It has 8 links to the chiplets, 8 x16 pci-e 4.0 links, and 8 ddr4 memory channels. The memory controllers may be grouped into 2 groups of 4 channels though. That would be at least 18 ports, while the one in Zen already had 10 or 11, depending on how the memory controller was grouped. For dual socket, 4 of the 8 x16 links get used for socket to socket comunication (it still uses the same socket as Epyc 1) which still leaves 4 x16 links from each socket for IO (8 x16 total). While that is still massive if it is a 512-bit switch, it actually isn’t quite as outlandish as I thought.

November 9, 2018 | 12:00 PM - Posted by SerdesTheSpeedDemon (not verified)

The SERDES are links are already many timee faster than that the PCIe standards with GF having some SERDES at 14nm that offer 56Gbps speeds(1). So a bit more data can be pushed out on a MCM if needed. And there are others that offer 56Gbs SERDES IP. So the CAKEs can can be made to move loads more data around the various chips on the MCM than even supply more enough bandwidth for PCIe 4.0. So maybe the switch does not need to be as wide depending on the available space and layers on that MCM's organic substrate.

Depending on the actual topology maybe there can even be some direct CPU Die/Chiplet to CPU Die/Chiplet links for cache coherency traffic but that will have to wait for the deep dives into Zen-2 Epyc/Rome's actual MCM fabric topology.

An L4 cache on that I/O die would be about the best news as far as an overall ability to hide Latency by having more of the program code residing on L4 if pushed out of the L3 cache level and the levels above(L1, L2).

(1)

"GLOBALFOUNDRIES Demonstrates Industry-Leading 56Gbps Long-Reach SerDes on Advanced 14nm FinFET Process Technology"

https://www.globalfoundries.com/news-events/press-releases/globalfoundri...

November 12, 2018 | 01:45 PM - Posted by SomeNewNewsOnEpyc2s14nmInputOutputDieFabPartner (not verified)

It's also been confermed by AMD's Mark Papermaster that the I/O die on Epyc/Rome is fabbed at GlobalFoundries(1).

"
IC: Can you confirm where the parts of Rome are manufactured?

MP: Chiplets on TSMC 7nm, the IO die is on GlobalFoundries 14nm. " (1)

So if the GF fabbed 14nm I/O die makes use of there is plenty of fast SerDes IP already Certified/Vetted on GF's 14nm process that's licensed from Samsung. I'd also expect that GF has tweaked that process for better power usage metrics also in a similar manner to what GF die for its 12nm process. 14nm is more matire and most of the Different SerDes IP providors already have their respective SerDes IP vetted on that 14nm process. TSMC's 7nm is still new so the SerDes IP providors like Synopsys/GF/Others will be in the first stages of workable SerDes IP at 7nm.

So high speed SerDes is swithing from analog to digital at 7nm for performance and power usage metrics gains.[See Video in refrence 2]

"eSilicon’s David Axelrad discusses the challenges with 56Gbps and 112Gps SerDes, and why the switch from analog to digital is required for performance and low power." (2)

(1)

"
Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster

by Ian Cutress on November 12, 2018 9:15 AM EST"

https://www.anandtech.com/show/13578/naples-rome-milan-zen-4-an-intervie...

(2)

"High-Speed SerDes At 7nm"

https://semiengineering.com/high-speed-serdes-at-7nm/

November 8, 2018 | 11:07 AM - Posted by Prodeous@Work (not verified)

As all of the CPU's will be based off the same 7nm 7 core module, plus slightly tweaked 14nm I/O module, cost wise for AMD this is simply perfect. If performance will be there. But seeing what they did with Zen, I feel confident they will provide.

The quick benchmark against dual xeon 8180 was a nice touch.

Still not sure why they kept the I/O at 14nm. With a 7nm node, or even something in-between, the would have even more space under that heatspreder allowing them to push even more cores :) like another 2 modules for 80 core 160 thread beast :) one can hope.

I'm very eager to see how they adopt this to the Ryzen and especially Threadripper as I already have 1950x, this might be a perfect upgrade.

Either way AMD, thank you very much for being competative with Zen,Zen+ and soon Zen2. We needed this for so long.

Now do the same with GPUs :)

November 8, 2018 | 12:05 PM - Posted by Anonymously Anonymous (not verified)

I think they want all the fab time for 7nm to be devoted to just the CPU's and GPU's(fab time is most likely very limited)so just fab the most important stuff at 7nm, everything else gets done at other nodes.
With that said, once 7nm production has ramped up and matured, I tend to agree that the I/O chip will get reduced in size and we may yet see another ramp up in core count.

November 8, 2018 | 02:13 PM - Posted by Drazen (not verified)

Well, 7nm is nice for pure digital but IO has some analog what might be problem with 7nm. Eg ESD, clamping diodes, etc needs to be more powerful, higher currents and voltage.
Military, space and other exotics are still 40/100 nm or even bigger.

November 8, 2018 | 10:26 PM - Posted by James

It is unclear whether they will use the same chiplet for the consumer part. They may make another die with the 8 cores and the IO portions on a single die all at 7 nm. They need to make a good mobile part and a single low power 7 nm die would be the way to do that. That cuts out all of the unnecessary stuff that is in Zen 1. A lot of the silicon area in Zen 1 is unused unless it is in a ThreadRipper or Epyc part. They could make a highly optimized single die part now. It would be great to have 8-core plus integrated gpu and/or an HBM gpu for mobile. They could make something similar to the part Intel has with an integrated AMD gpu.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.