Server and Workstation Upgrades
Intel is launching the Xeon E5-2600 v3 with up to 18 Haswell cores.
Today, on the eve of the Intel Developer Forum, the company is taking the wraps off its new server and workstation class high performance processors, Xeon E5-2600 v3. Known previously by the code name Haswell-EP, the release marks the entry of the latest microarchitecture from Intel to multi-socket infrastructure. Though we don't have hardware today to offer you in-house benchmarks quite yet, the details Intel shared with me last month in Oregon are simply stunning.
Starting with the E5-2600 v3 processor overview, there are more changes in this product transition than we saw in the move from Sandy Bridge-EP to Ivy Bridge-EP. First and foremost, the v3 Xeons will be available in core counts as high as 18, with HyperThreading allowing for 36 accessible threads in a single CPU socket. A new socket, LGA2011-v3 or R3, allows the Xeon platforms to run a quad-channel DDR4 memory system, very similar to the upgrade we saw with the Haswell-E Core i7-5960X processor we reviewed just last week.
The move to a Haswell-based microarchitecture also means that the Xeon line of processors is getting AVX 2.0, known also as Haswell New Instructions, allowing for 2x the FLOPS per clock per core. It also introduces some interesting changes to Turbo Mode and power delivery we'll discuss in a bit.
Maybe the most interesting architectural change to the Haswell-EP design is per core P-states, allowing each of the up to 18 cores running on a single Xeon processor to run at independent voltages and clocks. This is something that the consumer variants of Haswell do not currently support – every cores is tied to the same P-state. It turns out that when you have up to 18 cores on a single die, this ability is crucial to supporting maximum performance on a wide array of compute workloads and to maintain power efficiency. This is also the first processor to allow independent uncore frequency scaling, giving Intel the ability to improve performance with available headroom even if the CPU cores aren't the bottleneck.
QPI speeds get a slight upgrade on the platform as well, increasing available bandwidth between sockets in multi-processor systems. TDPs are raised as well – but are within 10-15 watts of the previous generation so its likely that not much redevelopment will be required by vendors to support the new Xeon family.
I won't spend too much time here; there are 22 different SKUs of the Xeon E5-2600 v3 that are being launched today, ranging from quad-core 3.0 GHz part to the 18-core E5-2699 v3 with a clock speed of 2.3 GHz. For low power environments there is a 55 watt processor option with 8-cores running at 1.8 GHz. Expect pricing to vary dramatically throughout the line as well.
Intel has built three chips for Haswell-EP with varying core count options. All fabricated on Intel's 22nm tri-gate transistor technology, one chip addresses 4-8 core processors, another addresses 6-12 core configurations and another for 14-18 core configurations. Intel has left themselves some overlap on the 6 and 8 core processors to bin and sort accordingly. These chips are BIG:
- High Core Count
- 5.56 Billion transistors
- 661 mm2 die size
- Medium Core Count
- 3.83 Billion transistors
- 483 mm2 die
- Low Core Count
- 2.6 Billion transistors
- 354 mm2 die
This table offers a high level overview of all the major changes found in the v3 revision of the Xeon E5-2600 and the real-world benefits of the technologies. For example, the on-die bus has been updated to include two fully buffered rings, a necessary addition to support the extreme core counts launching today. The QPI interface frequency increase improves multi-socket coherence performance and Last Level Cache (LLC) changes reduce latency and increase bandwidth.
A comparison of the Xeon E5-2600 v2 and v3 internal architectures demonstrates the necessity of the buffered switches on the two ring buses. IVB-E stretched to 12 cores but the move to 18 cores requires some updated communication protocols. It is also interesting to note that many of the products will feature "unbalanced" dies, where there are more cores on one ring bus than on the other. Intel assured us that these differences are very, very minimal and should in no way affect per-thread performance.
Performance of crypto algorithms see a sizeable performance gain with the jump to AVX 2.0 even compared to SNB and IVB.
But the AVX performance does come at a cost – because of increased power draw when being heavily utilized by AVX instructions, clock speeds are going to be lower. These processors will now have a rated core base and turbo speed but also an AVX base frequency and an AVX Turbo frequency.
Resulting frequencies will depend on the utilization levels of the AVX code. For this example slide, with the 18-core E5-2699 v3, the base clock of 1.9 GHz will extend up to 2.6 GHz for "most" AVX workloads. If you are running an application with heavy AVX code inclusion you might be limited to 2.2 GHz or lower. Obviously the efficiency improvements that you get with AVX code will more than make up for the clock speed differences.
Along with the new series of processors comes some new platform technology as well. The C612 chipset shares nearly identical specs to the X99 chipset launched with the consumer Haswell-E platform this month. That includes 10 SATA 6G ports, 6 USB 3.0 ports and 8 USB 2.0 ports and up to 8 lanes of PCIe 2.0.
This chipset has support for two socket systems but still connects to the primary processor through DMI, which is a bit of a bandwidth limiting factor.
For a workstation or server builder, in the 2S market, Haswell-EP offers an unmatched combination of performance and features. With 40 lanes of PCI Express 3.0 from EACH processor, there is plenty of room for accelerator cards (GPUs, Xeon Phi) to be included and of course you can support Intel's latest Fortville network controllers with support for 40 GbE connectivity.
For small-scale servers or workstation buyers that are looking for optimal levels of performance for tasks like video editing or rendering, the combination of a high core count Xeon E5-2600 v3 processor and the C612 chipset should be a screamer. Internally here at PC Perspective, building a system with 36 processing cores and 72 processing threads (dual E5-2699 v3 CPUs) is dream-worthy, likely decreasing work times for some tasks by several times. The only real hiccup would be that current Windows operating systems can only address blocks of threads up to 64 – meaning 8 threads would be underutilized in that build.
I am hoping to get my hands on some of this hardware after IDF this week to really put it to the test. I realize that much of the target audience for processors like the Xeon E5-2600 v3 is beyond the scope of what we usually cover (HPC, comms servers, etc.), but the performance metrics to be gathered would be impressive. It's hard to even remember when it started again, but Intel's dominance in the high performance server market continues for yet another generation.
36 processing cores and 72
36 processing cores and 72 processing threads, so that’s only 2 threads per core, and takes 2 separate 18 core Xeons. while power8 and Sparc appear to be going higher with the SMT(8 treads per core). I wonder if the AVX2 bandwidth handling has something to do with it. The windows builds for these are probably simi-custom and can handle the extra cores/threads, and Linux is much higher, but still limited by the amount of memory overhead, on an per OS resources basis. A power8(8 T per core) 12 cores is 96 threads, and a Sparc M7(8 T per core) 32 cores is 256 threads. I’m sure Google will have these Xeons, and be benching them against Power8, on different server workloads, the enterprise server websites will be running the independent benchmarks shortly, to bad Anand is now not on Anandtech, his evaluations will be missed. OS licensing and other cost considerations are going to limit Sparc, but Xeon and Power8 are about to do battle, and Google’s next moves are going to mean a lot going forward. What are the chances of getting motherboards to test both systems out, Xeon, or non made for Google(1) Power8(Tyan motherboard*)?
*The Tyan reference board is called the SP010, and the ATX board measures 12 inches by 9.6 inches.
(1) Google makes their own non standard motherboards for its power8s.
What no 18 core workstation
What no 18 core workstation parts? Come on intel lets see 18 cores at 3+ Ghz damn the TDP.
To go up to higher thread
To go up to higher thread counts per core, you need to sacrifice single thread performance. This is because a lot of the resources are shared between threads, so more threads mean less resources for each thread. There are still applications which perform better with HyperThreading off. AMD’s version of multi- threading actually shares less hardware between threads, which is probably why they do much better with multi-threaded workloads.. They have almost separate integer processing cores with shared FP units. Intel shares a lot of the integer processing core also, but they made it wider to compensate.
Going higher thread count per core may not actually be that worthwhile going forward, at least not for consumer applications. I tried to calculate the size of a broadwell core (14 nm) based on one of the earlier photos and a total die size of 82 mm2; I came up with around 12 mm2 for an entire core, including L2 and 2 MB of L3 cache. This could be inaccurate but should be close. For applications that can use more threads, it will probably make more sense to just use more physical cores, since they are so small. AMD may be going that route with their ARM based server chips; just throw more cores at the problem, if single thread performance isn’t a priority.
I am more curious about how using the new vector instructions compare to just running on a gpu. It seems like anything parallel enough to actually make good use of them could run faster on a gpu (more memory bandwidth and more hardware, but possibly slower clocked hardware).
And this single tread
And this single tread performance penalty, compared to the non x86 Sparc, and Powre8 server SKUs, with more SMT, is a tradeoff. AMD has to get their custom(Non ARM holdings reference designs)wide order server cores that can Run the ARMv8 ISA, to really compete in the coming ARM based server market. AMD will need that new x86 microarchitecture rework done ASAP also, but server workloads will vary from simple web page serving, that can be handled by ARM ISA based designs, to the heavy analytics that are done on the x86/power8/Sparc systems. In systems like the power8, others, dynamical SMT allows the number of treads to be limited from 1, to Max(8 in this case for power8) so why does Intel not do so, the ability to scale the tread resources should not prevent more SMT. The main reason has more to do with RISC, and CISC, and the amount of on die resources that are needed to duplicate extra Instruction fetch units, and the FP, and Integer units, etc., more transistors needed for CISC verses RISC designs(Sparc, Power8). Intel appears to be stressing more CMP(core multiprocessing) than SMT, and AVX2 resources, that take up die more space. Intel’s offerings are more spread out across more SKUs, than the other non x86, and competing x86 designs. I’m seeing a lot of benchmarking of these Intel server parts, on traditional gaming/tech websites, and benchmarks that are mostly non server related, so the smaller core count Xeon SKUs are going to be for the low end workstation market, it would be nice to see some benchmarks/benchmarking of the non intended for high end enterprise server SKUs built around the v3 systems.
And what about the TSX instruction Erratum, on these initial products?