Review Index:
Feedback

Intel Xeon Scalable Processor Launch - New Architecture, New Platform for Data Center

Author:
Subject: Processors
Manufacturer: Intel

A massive lineup

The amount and significance of the product and platform launches occurring today with the Intel Xeon Scalable family is staggering. Intel is launching more than 50 processors and 7 chipsets falling under the Xeon Scalable product brand, targeting data centers and enterprise customers in a wide range of markets and segments. From SMB users to “Super 7” data center clients, the new lineup of Xeon parts is likely to have an option targeting them.

All of this comes at an important point in time, with AMD fielding its new EPYC family of processors and platforms, for the first time in nearly a decade becoming competitive in the space. That decade of clear dominance in the data center has been good to Intel, giving it the ability to bring in profits and high margins without the direct fear of a strong competitor. Intel did not spend those 10 years flat footed though, and instead it has been developing complimentary technologies including new Ethernet controllers, ASICs, Omni-Path, FPGAs, solid state storage tech and much more.

View Full Size

Our story today will give you an overview of the new processors and the changes that Intel’s latest Xeon architecture offers to business customers. The Skylake-SP core has some significant upgrades over the Broadwell design before it, but in other aspects the processors and platforms will be quite similar. What changes can you expect with the new Xeon family?

View Full Size

Per-core performance has been improved with the updated Skylake-SP microarchitecture and a new cache memory hierarchy that we had a preview of with the Skylake-X consumer release last month. The memory and PCIe interfaces have been upgraded with more channels and more lanes, giving the platform more flexibility for expansion. Socket-level performance also goes up with higher core counts available and the improved UPI interface that makes socket to socket communication more efficient. AVX-512 doubles the peak FLOPS/clock on Skylake over Broadwell, beneficial for HPC and analytics workloads. Intel QuickAssist improves cryptography and compression performance to allow for faster connectivity implementation. Security and agility get an upgrade as well with Boot Guard, RunSure, and VMD for better NVMe storage management. While on the surface this is a simple upgrade, there is a lot that gets improved under the hood.

View Full Size

We already had a good look at the new mesh architecture used for the inter-core component communication. This transition away from the ring bus that was in use since Nehalem gives Skylake-SP a couple of unique traits: slightly longer latencies but with more consistency and room for expansion to higher core counts.

View Full Size

Intel has changed the naming scheme with the Xeon Scalable release, moving away from “E5/E7” and “v4” to a Platinum, Gold, Silver, Bronze nomenclature. The product differentiation remains much the same, with the Platinum processors offering the highest feature support including 8-sockets, highest core counts, highest memory speeds, connectivity options and more. To be clear: there are a lot of new processors and trying to create an easy to read table of features and clocks is nearly impossible. The highlights of the different families are:

  • Xeon Platinum (81xx)
    • Up to 28 cores
    • Up to 8 sockets
    • Up to 3 UPI links
    • 6-channel DDR4-2666
    • Up to 1.5TB of memory
    • 48 lanes of PCIe 3.0
    • AVX-512 with 2 FMA per core
  • Xeon Gold (61xx)
    • Up to 22 cores
    • Up to 4 sockets
    • Up to 3 UPI links
    • 6-channel DDR4-2666
    • AVX-512 with 2 FMA per core
  • Xeon Gold (51xx)
    • Up to 14 cores
    • Up to 2 sockets
    • 2 UPI links
    • 6-channel DDR4-2400
    • AVX-512 with 1 FMA per core
  • Xeon Silver (41xx)
    • Up to 12 cores
    • Up to 2 sockets
    • 2 UPI links
    • 6-channel DDR4-2400
    • AVX-512 with 1 FMA per core
  • Xeon Bronze (31xx)
    • Up to 8 cores
    • Up to 2 sockets
    • 2 UPI links
    • No Turbo Boost
    • 6-channel DDR4-2133
    • AVX-512 with 1 FMA per core

That’s…a lot. And it only gets worse when you start to look at the entire SKU lineup with clocks, Turbo Speeds, cache size differences, etc. It’s easy to see why the simplicity argument that AMD made with EPYC is so attractive to an overwhelmed IT department.

View Full Size

Two sub-categories exist with the T or F suffix. The former indicates a 10-year life cycle (thermal specific) while the F is used to indicate units that integrate the Omni-Path fabric on package. M models can address 1.5TB of system memory. This diagram above, which you should click to see a larger view, shows the scope of the Xeon Scalable launch in a single slide. This release offers buyers flexibility but at the expense of complexity of configuration.

Continue reading about the new Intel Xeon Scalable Skylake-SP platform!

The Intel Xeon Scalable Processor Feature Overview

Though the underlying architecture of the Skylake-SP release we see today shares a lot with the Skylake-X family launched for the HEDT market last month, there are quite a few differences in the platform.

View Full Size

Features like 8-socket support, Omni-Path support, four 10GigE connections built into the chipset, VMD for storage, QuickAssist Technology stand out.

View Full Size

Xeon Scalable processors will work in 2S, 4S and 8S configurations, with the third UPI link (replacement for QPI) optional on the dual and quad configurations. The added UPI interface will lower latency between cores requiring one less hop between sockets while increasing bandwidth for improved scalability.

View Full Size

This slide shows the feature differences of the previous generation Xeon and the new Xeon Scalable family. We have core count increases from 22 to 28 cores, additional PCIe lane availability, 50% more memory channels running at a higher frequency but also a higher TDP range, going all the way up to 205 watts. With the recent discussion swirling around the heat of the Skylake-X processors, you have to wonder how Intel will be addressing this concern with high core count parts for the data center.

View Full Size

Through a combination of architectural tweaks, including an improved branch predictor, higher throughput instruction decoder, better scheduling and larger buffers, Intel estimates that integer performance should see around a 10% improvement at the same clock speed when compared to Broadwell-EP.

View Full Size

AVX-512 is a big part of the improvement in floating point performance with Skylake-SP, offering 512-bit wide vector acceleration and double the peak FLOPS throughput, single and double precision, of the prior generation. Software needs to be coded directly for this new instruction set of course, and enterprise software can often lag due to compatibility concerns, so the benefit of AVX-512 will be judged over a longer period.

View Full Size

Interestingly, the physical layout of the Skylake core has been bolted onto for the Skylake-SP release, with the diagram above being close to scale based on our conversations with Intel. The added 768KB of L2 cache was added outside the main core, which does add a clock or two of latency.

View Full Size

We already know that AVX code forces a CPU core to use more power but with Skylake-SP Intel has implemented the ability for each core to dynamically adjust its Turbo state based on the code being executed on it.

View Full Size

Along with the improvements in scalability that the mesh architecture offers Skylake-SP, Intel has distributed the cache and home agents across each of the core nodes. This gives the processor the ability to have more memory requests in flight simultaneously compared to the previous generation that had only a few QPI agents.

View Full Size

I already talked about the cache redesign with Skylake-X and SP and even detailed some the performance difference that it accounts for. This new design has a smaller L3 cache but larger L2 for better thread-memory locality while moving to a non-inclusive design. The benefits for the data center include improved virtualization performance with a larger private L2 cache per core and a reduction in uncore activity due to a fewer external cache memory requests.

View Full Size

This graph shows the relative change in cache misses from Broadwell-EP to Skylake-SP in various workloads. Intel is upfront that in some cases, the L3 cache misses are in fact worse than Broadwell, but when the L2 cache hits improve, they can improve DRASTICALLY. Look at a workload like POV-Ray that has essentially been reduced to zero cache misses thanks to the larger capacity of the L2.

View Full Size

The memory system on Skylake-SP gets a significant upgrade this generation, moving from a quad-channel to a 6-channel memory controller, with many processor SKUs offering DDR4-2666 support as well. This equates to a 60% improvement in total memory bandwidth availability per socket compared to Broadwell-EP. By utilizing the mesh architecture and breaking the memory controller into two 3-channel segments at opposite sides of the die, Intel could lower the distance and number of hops (on average) between any single core and memory.

View Full Size

Replacing QPI is the UPI, Ultra Path Interconnect. With a combination of improved messaging efficiency and higher clock rates, the UPI links can go as high as 10.4 GT/s, up from 9.6 GT/s on QPI. (Though some of the new CPUs will still run at 9.6 GT/s.)

View Full Size

Intel VMD, Volume Management Device, is a new capability for Skylake-SP that enables a way to aggregate multiple NVMe SSDs into a single volume while enabling other features like RAID at the same time. This occurs outside of the OS, handling all enumeration and events/errors inside the domain. I am working with Allyn to do a deeper dive on what this means for enterprise and data center storage so hopefully we will have more details soon.

View Full Size

Optimized Turbo profiles on Skylake-SP allow the CPU to run at higher frequencies that the previous generation by moving away from the simplistic “one clock bin per active core” algorithm. This basically means that the CPU will be running at higher frequencies more often when partially loaded, but doesn’t change the fundamental behavior under a full load.

View Full Size

This is also the first time that Speed Shift is introduced at the data center level, bringing faster clock speed changes to the server markets. While in the consumer space this feature is meant to improve the responsiveness of a notebook or PC, for the server market it allows for better control over the energy/performance metrics

View Full Size

By using software to moderate the aggressiveness of the clock speed increases, OEMs can offer slower or faster clock and power draw ramps depending on the workload and the requirements it needs.

View Full Size

Available only on the Platinum and the 61xx Gold series of processors, Intel offers a handful of Xeon parts with integrated Omni-Path fabric on package, utilizing a dedicated 16 lanes of PCIe 3.0 for connectivity. This is a physical connector on the processor itself, and will require specific platform development, though a board can be designed to use both types of processors. 


July 11, 2017 | 05:50 PM - Posted by Hishnash (not verified)

that is one complex lineup.

I know it's targeted at data centers who probably employ a few people full time to understand this but still since amd are able to provide a single line up with all CPUs having the same Ram and PCIe its a shame Intel can't do offer the same and just make these a little less confusing.

July 11, 2017 | 07:15 PM - Posted by Xebec

Interesting IPC chart.. they're saying -

IPC is up 70% from Merom (Core 2) to today.. (remove the multiplier at the beginning as they're comparing Merom to Yonah, then remultiply everything), and 35% from Sandy Bridge to today.

I'm guessing the 3.8x per socket is more tied with the power efficiency improvements over top of IPC - performance/watt over the decade..

July 11, 2017 | 08:18 PM - Posted by Clmentoz (not verified)

It looks like Anandtech had some server SKUs to look at for a few weeks. And it's too bad that Anandtech did not ask AMD any questions about AMD's Infinity Fabric on Epyc and how it relates to any Infinity Fabric based Direct Attatched GPU acceleration via the Infinity Fabric EPYC CPU to any Vega/Vega micro-arch GPUs(That Support the Infinity Fabric) accelerators for FP/AI workloads.

AMD did cover this as a possible Epyc to Vega "NVLink" style coherent interface option for Epyc/Vega and GPU accelerated compute workloads with fully coherent Infinity Fabric communication Epyc to Vega in there presentation a few months back with those seismic benchmarks on some Epyc engineering samples. But as of yet there have been no demonstrations for AMD and maybe for SIGGRAPH there will be more information forthcoming.

From Anandtech:

"First of all, we have to emphasize that we were only able to spend about a week on the AMD server, and about two weeks on the Intel system. With the complexity of both server hardware and especially server software, that is very little time. There is still a lot to test and tune, but the general picture is clear." [see: closing thoughts (1)]

(1)

"Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?"

http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cp...

July 13, 2017 | 04:00 PM - Posted by Anonymouse (not verified)

Tons of Info about both Fabrics and Chips is at ServeTheHome .

Beating both (in performance) is the POWER9 which has been completely sold-out (to Google and Facebook) - next year they'll be more CPUs available for us poor folks.

July 11, 2017 | 08:56 PM - Posted by RealExascale (not verified)

No Optane support for those prices? Come on! Whats all that about OLTP? Anyone really going to choose x86 over SPARC or a real mainframe for OLTP?

The only nice thing about this release is AVX-512 and the potential for the highest end CPUs to scale to 8 sockets gluelessly. Thats not new or unique though. E7s have done that for years.

SPARC or ARM would be better for scale up anyway. ARM and Epyc both have 48bit physical addressing and Intel is still using 46bit. 46bit PA on a $13,000 CPU!!!

Epyc also has 2TB per socket vs 1.5, 128 PCI-e lanes vs 48, and fully encrypted memory(so does Ryzen Pro). Intel gas nothing to compete with that at all. Epyc is a way better value.

AMDs fully encrypted memory really destroys these new Xeons. Intel better implement it for Optane DIMM supporting platforms or people wont need cryogenics for cold boot attacks anymore!

July 12, 2017 | 06:58 AM - Posted by psuedonymous

"Despite all the options provided, some areas seem left out to me. Intel’s highest core count processor targeted at the “up to 2-socket” market comes in at $1000 but only has 12-cores and a 2.1 GHz clock speed. Similarly, the highest “up to 4-socket” processor is the 6154 with 18-cores and a 3.0 GHz clock speed, priced at $3500. Neither of these seems to be particularly well positioned to take on the AMD EPYC processors launched last month that will offer 32-cores/64-threads and clock speeds up to 3.2 GHz for around $4200."

That's because unlike the previous E5/E7 system, you can grab one of those crazy 28-core chips and stick it in a single-socket board if you wanted. Single-socket-across-the-range is a big change from the 1/2/many socket splits previously implemented.

It's; also likely why Intel have no announced any LGA2066 Xeons: there probably aren't any. The presence of 'small' Xeon Gold dies (that's you'd normally expect to see sharing the small prosumer socket) on LGA3647 almost confirms this.

July 12, 2017 | 11:27 AM - Posted by Ryan Shrout

Interesting thought. My thinking had been that Intel was making LGA2066 Xeons for the iMac Pro offerings.

July 13, 2017 | 08:47 AM - Posted by Hishnash (not verified)

Apple might be getting slictly custom chips
a version of https://ark.intel.com/products/124943/Intel-Xeon-Gold-6144-Processor-24_... with the Tubo Boost v3 Enabled (to give the 4.5Ghz boost Apple have listed on the website). Given the i9 parts are basically these chips with ECC disabled it possible Intel could create some.

Apple has in the past been given custom Xeons for their Mac Pro (deelided ones so that Apple could put good thermal contact directly onto the heatsink) given the chatter about heat issues with the i9 line Apple may be required this time round as well.

July 12, 2017 | 06:02 PM - Posted by Toby Broom (not verified)

Traditionally there was the 16xx xeons that were the counter parts of the i7's so you would imagine that these would be created?

It will be intresting to see how HP develops the Z460, Z660 & Z860 workstations in comparision to the Apple xPro's

July 13, 2017 | 08:48 AM - Posted by Hishnash (not verified)

Unless HP can get the same deal as aplle and get custom https://ark.intel.com/products/124943/Intel-Xeon-Gold-6144-Processor-24_... parts with Tubo Boost enabled. the iMac Pro will deliver more power for lower core count tasks were the turbo boost plays a part.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.