Podcast #458 - Intel Xeons, ThunderBolt 3 GPU chassis, Affordable 10GbE, and more!

Subject: General Tech | July 13, 2017 - 11:40 AM |
Tagged: xeon, x299, video, thunderbolt 3, sapphire, RX470, rift, radeon, podcast, nand, Intel, HDK2, gigabyte, external gpu, asus, 10GbE

PC Perspective Podcast #458 - 07/13/17

Join us for Intel Xeon launch, external ThunderBolt3 GPUs, 10Gb Ethernet, and more!

You can subscribe to us through iTunes and you can still access it directly through the RSS page HERE.

The URL for the podcast is: http://pcper.com/podcast - Share with your friends!

Hosts: Ryan Shrout, Jeremy Hellstrom, Josh Walrath, Allyn Malventano

Peanut Gallery: Ken Addison, Alex Lustenberg

Program length: 1:38:08
 
Podcast topics of discussion:
  1. Week in Review:
  2. News items of interest:
  3. Hardware/Software Picks of the Week
    1. Ryan: ASUS XG-C100C lol
    2. Jeremy: Um, well I keep meaning to play Deserts of Kharak
  4. Closing/outro

Subscribe to the PC Perspective YouTube Channel for more videos, reviews and podcasts!!

Source:
Author:
Subject: Processors
Manufacturer: Intel

A massive lineup

The amount and significance of the product and platform launches occurring today with the Intel Xeon Scalable family is staggering. Intel is launching more than 50 processors and 7 chipsets falling under the Xeon Scalable product brand, targeting data centers and enterprise customers in a wide range of markets and segments. From SMB users to “Super 7” data center clients, the new lineup of Xeon parts is likely to have an option targeting them.

All of this comes at an important point in time, with AMD fielding its new EPYC family of processors and platforms, for the first time in nearly a decade becoming competitive in the space. That decade of clear dominance in the data center has been good to Intel, giving it the ability to bring in profits and high margins without the direct fear of a strong competitor. Intel did not spend those 10 years flat footed though, and instead it has been developing complimentary technologies including new Ethernet controllers, ASICs, Omni-Path, FPGAs, solid state storage tech and much more.

cpus.jpg

Our story today will give you an overview of the new processors and the changes that Intel’s latest Xeon architecture offers to business customers. The Skylake-SP core has some significant upgrades over the Broadwell design before it, but in other aspects the processors and platforms will be quite similar. What changes can you expect with the new Xeon family?

01-11 copy.jpg

Per-core performance has been improved with the updated Skylake-SP microarchitecture and a new cache memory hierarchy that we had a preview of with the Skylake-X consumer release last month. The memory and PCIe interfaces have been upgraded with more channels and more lanes, giving the platform more flexibility for expansion. Socket-level performance also goes up with higher core counts available and the improved UPI interface that makes socket to socket communication more efficient. AVX-512 doubles the peak FLOPS/clock on Skylake over Broadwell, beneficial for HPC and analytics workloads. Intel QuickAssist improves cryptography and compression performance to allow for faster connectivity implementation. Security and agility get an upgrade as well with Boot Guard, RunSure, and VMD for better NVMe storage management. While on the surface this is a simple upgrade, there is a lot that gets improved under the hood.

01-12 copy.jpg

We already had a good look at the new mesh architecture used for the inter-core component communication. This transition away from the ring bus that was in use since Nehalem gives Skylake-SP a couple of unique traits: slightly longer latencies but with more consistency and room for expansion to higher core counts.

01-18 copy.jpg

Intel has changed the naming scheme with the Xeon Scalable release, moving away from “E5/E7” and “v4” to a Platinum, Gold, Silver, Bronze nomenclature. The product differentiation remains much the same, with the Platinum processors offering the highest feature support including 8-sockets, highest core counts, highest memory speeds, connectivity options and more. To be clear: there are a lot of new processors and trying to create an easy to read table of features and clocks is nearly impossible. The highlights of the different families are:

  • Xeon Platinum (81xx)
    • Up to 28 cores
    • Up to 8 sockets
    • Up to 3 UPI links
    • 6-channel DDR4-2666
    • Up to 1.5TB of memory
    • 48 lanes of PCIe 3.0
    • AVX-512 with 2 FMA per core
  • Xeon Gold (61xx)
    • Up to 22 cores
    • Up to 4 sockets
    • Up to 3 UPI links
    • 6-channel DDR4-2666
    • AVX-512 with 2 FMA per core
  • Xeon Gold (51xx)
    • Up to 14 cores
    • Up to 2 sockets
    • 2 UPI links
    • 6-channel DDR4-2400
    • AVX-512 with 1 FMA per core
  • Xeon Silver (41xx)
    • Up to 12 cores
    • Up to 2 sockets
    • 2 UPI links
    • 6-channel DDR4-2400
    • AVX-512 with 1 FMA per core
  • Xeon Bronze (31xx)
    • Up to 8 cores
    • Up to 2 sockets
    • 2 UPI links
    • No Turbo Boost
    • 6-channel DDR4-2133
    • AVX-512 with 1 FMA per core

That’s…a lot. And it only gets worse when you start to look at the entire SKU lineup with clocks, Turbo Speeds, cache size differences, etc. It’s easy to see why the simplicity argument that AMD made with EPYC is so attractive to an overwhelmed IT department.

01-20 copy.jpg

Two sub-categories exist with the T or F suffix. The former indicates a 10-year life cycle (thermal specific) while the F is used to indicate units that integrate the Omni-Path fabric on package. M models can address 1.5TB of system memory. This diagram above, which you should click to see a larger view, shows the scope of the Xeon Scalable launch in a single slide. This release offers buyers flexibility but at the expense of complexity of configuration.

Continue reading about the new Intel Xeon Scalable Skylake-SP platform!

Microcode Bug Affects Intel Skylake and Kaby Lake CPUs

Subject: Processors | June 26, 2017 - 08:53 AM |
Tagged: xeon, Skylake, processor, pentium, microcode, kaby lake, Intel, errata, cpu, Core, 7th generation, 6th generation

A microcode bug affecting Intel Skylake and Kaby Lake processors with Hyper-Threading has been discovered by Debian developers (who describe it as "broken hyper-threading"), a month after this issue was detailed by Intel in errata updates back in May. The bug can cause the system to behave 'unpredictably' in certain situations.

Intel CPUs.jpg

"Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (eg RAX, EAX or AX for AH) may cause unpredictable system behaviour. This can only happen when both logical processors on the same physical processor are active."

Until motherboard vendors begin to address the bug with BIOS updates the only way to prevent the possibility of this microcode error is to disable HyperThreading. From the report at The Register (source):

"The Debian advisory says affected users need to disable hyper-threading 'immediately' in their BIOS or UEFI settings, because the processors can 'dangerously misbehave when hyper-threading is enabled.' Symptoms can include 'application and system misbehaviour, data corruption, and data loss'."

The affected models are 6th and 7th-gen Intel processors with HyperThreading, which include Core CPUs as well as some Pentiums, and Xeon v5 and v6 processors.

Source: The Register

Intel Skylake-X and Skylake-SP Utilize Mesh Architecture for Intra-Chip Communication

Subject: Processors | June 15, 2017 - 04:00 PM |
Tagged: xeon scalable, xeon, skylake-x, skylake-sp, skylake-ep, ring, mesh, Intel

Though we are just days away from the release of Intel’s Core i9 family based on Skylake-X, and a bit further away from the Xeon Scalable Processor launch using the same fundamental architecture, Intel is sharing a bit of information on how the insides of this processor tick. Literally. One of the most significant changes to the new processor design comes in the form of a new mesh interconnect architecture that handles the communications between the on-chip logical areas.

Since the days of Nehalem-EX, Intel has utilized a ring-bus architecture for processor design. The ring bus operated in a bi-directional, sequential method that cycled through various stops. At each stop, the control logic would determine if data was to be the collected to deposited with that module. These ring bus stops are located at memory controllers, CPU cores / caches, the PCI Express interface, memory controllers, LLCs, etc. This ring bus was fairly simple and easily expandable by simply adding more stops on the ring bus itself.

xeon-processor-5.jpg

However, over several generations, the ring bus has become quite large and unwieldly. Compare the ring bus from Nehalem above, to the one for last year’s Xeon E5 v5 platform.

intel-xeon-e5-v4-block-diagram-hcc.jpg

The spike in core counts and other modules caused a ballooning of the ring that eventually turned into multiple rings, complicating the design. As you increase the stops on the ring bus you also increase the physical latency of the messaging and data transfer, for which Intel compensated by increasing bandwidth and clock speed of this interface. The expense of that is power and efficiency.

For an on-die interconnect to remain relevant, it needs to be flexible in bandwidth scaling, reduce latency, and remain energy efficient. With 28-core Xeon processors imminent, and new IO capabilities coming along with it, the time for the ring bus in this space is over.

Starting with the HEDT and Xeon products released this year, Intel will be using a new on-chip design called a mesh that Intel promises will offer higher bandwidth, lower latency, and improved power efficiency. As the name implies, the mesh architecture is one in which each node relays messages through the network between source and destination. Though I cannot share many of the details on performance characteristics just yet, Intel did share the following diagram.

intelmesh.png

As Intel indicates in its blog on the mesh announcements, this generic diagram “shows a representation of the mesh architecture where cores, on-chip cache banks, memory controllers, and I/O controllers are organized in rows and columns, with wires and switches connecting them at each intersection to allow for turns. By providing a more direct path than the prior ring architectures and many more pathways to eliminate bottlenecks, the mesh can operate at a lower frequency and voltage and can still deliver very high bandwidth and low latency. This results in improved performance and greater energy efficiency similar to a well-designed highway system that lets traffic flow at the optimal speed without congestion.”

The bi-directional mesh design allows a many-core design to offer lower node to node latency than the ring architecture could provide, and by adjusting the width of the interface, Intel can control bandwidth (and by relation frequency). Intel tells us that this can offer lower average latency without increasing power. Though it wasn’t specifically mentioned in this blog, the assumption is that because nothing is free, this has a slight die size cost to implement the more granular mesh network.

Using a mesh architecture offers a couple of capabilities and also requires a few changes to the cache design. By dividing up the IO interfaces (think multiple PCI Express banks, or memory channels), Intel can provide better average access times to each core by intelligently spacing the location of those modules. Intel will also be breaking up the LLC into different segments which will share a “stop” on the network with a processor core. Rather than the previous design of the ring bus where the entirety of the LLC was accessed through a single stop, the LLC will perform as a divided system. However, Intel assures us that performance variability is not a concern:

Negligible latency differences in accessing different cache banks allows software to treat the distributed cache banks as one large unified last level cache. As a result, application developers do not have to worry about variable latency in accessing different cache banks, nor do they need to optimize or recompile code to get a significant performance boosts out of their applications.

There is a lot to dissect when it comes to this new mesh architecture for Xeon Scalable and Core i9 processors, including its overall effect on the LLC cache performance and how it might affect system memory or PCI Express performance. In theory, the integration of a mesh network-style interface could drastically improve the average latency in all cases and increase maximum memory bandwidth by giving more cores access to the memory bus sooner. But, it is also possible this increases maximum latency in some fringe cases.

Further testing awaits for us to find out!

Source: Intel

AMD Compares 1x 32-Core EPYC to 2x 12-Core Xeon E5s

Subject: Processors | May 17, 2017 - 04:05 AM |
Tagged: amd, EPYC, 32 core, 64 thread, Intel, Broadwell-E, xeon

AMD has formally announced their EPYC CPUs. While Sebastian covered the product specifications, AMD has also released performance claims against a pair of Intel’s Broadwell-E Xeons. While Intel’s E5-2650 v4 processors have an MSRP of around $1170 USD, each, we don’t know how that price will compare to AMD’s offering. At first glance, pitting thirty two cores against two twelve-core chips seems a bit unfair, although it could end up being a very fair comparison if the prices align.

amd-2017-epyc-ubuntucompile.jpg

Image Credit: Patrick Moorhead

Patrick Moorhead, who was at the event, tweeted out photos of a benchmark where Ubuntu was compiled over GCC. It looks like EPYC completed in just 33.7s while the Broadwell-E chip took 37.2s (making AMD’s part ~9.5% faster). While this, again, stems from having a third more cores, this depends on how much AMD is going to charge you for them, versus Intel’s current pricing structure.

amd-2017-epyc-threads.jpg

Image Credit: Patrick Moorhead

This one chip also has 128 PCIe lanes, rather than Intel’s 80 total lanes spread across two chips.

ioSafe Launches 5-Bay Xeon-Based 'Server 5' Fireproof NAS

Subject: Storage | March 8, 2017 - 09:58 PM |
Tagged: xeon, raid, NAS, iosafe, fireproof

ioSafe, makers of excellent fireproof external storage devices and NAS units, has introduced what they call the 'Server 5':

Server5-front2-.jpg

The Server 5 is a completely different twist for an ioSafe NAS. While previous units have essentially been a fireproof drive cage surrounding Synology NAS hardware, the Server 5 is a full blown Xeon D-1520 or D-1521 quad core HT, 16GB of DDR4, an Areca ARC-1225-8i hardware RAID controller (though only 5 ports are connected to the fireproof drive cage). ioSafe supports the Server 5 with Windows Server 2012 R2 or you can throw your preferred flavor of Linux on there. The 8-thread CPU and 16GB of RAM mean that you can have plenty of other services running straight off of this unit. It's not a particularly speedy CPU, but keep in mind that the Areca RAID card offloads all parity calculations from the host.

Server5-rear.jpg

Overall the Server 5 looks nearly identical to the ioSafe 1515+, but with an extra inch or two of height added to the bottom to accommodate the upgraded hardware. The Server 5 should prove to be a good way to keep local enterprise / business data protected and available immediately after a disaster. While only the hard drives will be protected in a fire, they can be popped out of the charred housing and shifted to a backup Server 5 or just migrated to another Areca-driven NAS system. For those wondering what a typical post-fire ioSafe looks like, here ya go:

1515+.jpg

Note how clean the cage and drives are (and yes, they all still work)!

Press blast appears after the break.

Source: ioSafe

Lenovo Announces new ThinkPad P51s P51 and P71 Mobile Workstations

Subject: Systems, Mobile | February 6, 2017 - 03:37 PM |
Tagged: xeon, Thinkpad, quadro, P71, P51s, P51, nvidia, notebook, mobile workstation, Lenovo, kaby lake, core i7

Lenovo has announced a trio of new ThinkPad mobile workstations, featuring updated Intel 7th-generation Core (Kaby Lake) processors and NVIDIA Quadro graphics, and among these is the thinnest and lightest ThinkPad mobile workstation to date in the P51s.

P51s.jpg

"Engineered to deliver breakthrough levels of performance, reliability and long battery life, the ThinkPad P51s features a new chassis, designed to meet customer demands for a powerful but portable machine. Developed with engineers and professional designers in mind, this mobile workstation features Intel’s 7th generation Core i7 processors and the latest NVIDIA Quadro dedicated workstation graphics, as well as a 4K UHD IPS display with optional IR camera."

Lenovo says that the ThinkPad P51s is more than a half pound lighter than the previous generation (P50s), stating that "the P51s is the lightest and thinnest mobile workstation ever developed by ThinkPad" at 14.4 x 9.95 x 0.79 inches, and weight starting at 4.3 lbs.

Specs for the P51s include:

  • Up to a 7th Generation Intel Core i7 Processor
  • NVIDIA Quadro M520M Graphics
  • Choice of standard or touchscreen FHD (1920 x 1080) IPS, or 4K UHD (3840 x 2160) IPS display
  • Up to 32 GB DDR4 2133 RAM (2x SODIMM slots)
  • Storage options including up to 1 TB (5400 rpm) HDD and 1 TB NVMe PCIe SSDs
  • USB-C with Intel Thunderbolt 3
  • 802.11ac and LTE-A wireless connectivity

Lenovo also announced the ThinkPad P51, which is slightly larger than the P51s, but brings the option of Intel Xeon E3-v6 processors (in addition to Kaby Lake Core i7 CPUs), Quadro M2200M graphics, faster 2400 MHz memory up to 64 GB (4x SODIMM slots), and up to a 4K IPS display with X-Rite Pantone color calibration.

Thinkpad_P51.jpg

Finally there is the new VR-ready P71 mobile workstation, which offers up to an NVIDIA Quadro P5000M GPU along with Oculus and HTC VR certification.

"Lenovo is also bringing virtual reality to life with the new ThinkPad P71. One of the most talked about technologies today, VR has the ability to bring a new visual perspective and immersive experience to our customers’ workflow. In our new P71, the NVIDIA Pascal-based Quadro GPUs offer a stunning level of performance never before seen in a mobile workstation, and it comes equipped with full Oculus and HTC certifications, along with NVIDIA’s VR-ready certification."

Thinkpad_P71.jpg

Pricing and availability is as follows:

  • ThinkPad P51s, starting at $1049, March
  • ThinkPad P51, starting at $1399, April
  • ThinkPad P71, starting at $1849, April
Source: Lenovo

Intel Launches Xeon E7 v4 Processors

Subject: Processors | June 7, 2016 - 09:39 AM |
Tagged: xeon e7 v4, xeon e7, xeon, Intel, broadwell-ex, Broadwell

Yesterday, Intel launched eleven SKUs of Xeon processors that are based on Broadwell-EX. While I don't follow this product segment too closely, it's a bit surprising that Intel launched them so close to consumer-level Broadwell-E. Maybe I shouldn't be surprised, though.

intel-logo-cpu.jpg

These processors scale from four cores up to twenty-four of them, with HyperThreading. They are also available in cache sizes from 20MB up to 60MB. With Intel's Xeon naming scheme, the leading number immediately after the E7 in the product name denotes the number of CPUs that can be installed in a multi-socket system. The E7-8XXX line can be run in an eight-socket motherboard, while the E7-4XXX models are limited to four sockets per system. TDPs range between 115W and 165W, which is pretty high, but to be expected for a giant chip that runs at a fairly high frequency.

Intel Xeon E7 v4 launched on June 6th with listed prices between $1223 to $7174 per CPU.

Source: Intel

Intel to Ship FPGA-Accelerated Xeons in Early 2016

Subject: Processors | November 20, 2015 - 06:21 PM |
Tagged: xeon, Intel, FPGA

UPDATE (Nov 26th, 3:30pm ET): A few readers have mentioned that FPGAs take much less than hours to reprogram. I even received an email last night that claims FPGAs can be reprogrammed in "well under a second." This differs from the sources I've read when I was reading up on their OpenCL capabilities (for potential evolutions of projects) back in ~2013. That said, multiple sources, including one who claim to have personal experience with FPGAs, say that it's not the case. Also, I've never used an FPGA myself -- again, I was just researching them to see where some GPU-based projects could go.

Designing integrated circuits, as I've said a few times, is basically a game. You have a blank canvas that you can etch complexity into. The amount of “complexity” depends on your fabrication process, how big your chip is, the intended power, and so forth. Performance depends on how you use the complexity to compute actual tasks. If you know something special about your workload, you can optimize your circuit to do more with less. CPUs are designed to do basically anything, while GPUs assume similar tasks can be run together. If you will only ever run a single program, you can even bake some or all of its source code into hardware called an “application-specific integrated circuit” (ASIC), which is often used for video decoding, rasterizing geometry, and so forth.

intel-2015-fpga-old-atom.png

This is an old Atom back when Intel was partnered with Altera for custom chips.

FPGAs are circuits that can be baked into a specific application, but can also be reprogrammed later. Changing tasks requires a significant amount of time (sometimes hours) but it is easier than reconfiguring an ASIC, which involves removing it from your system, throwing it in the trash, and printing a new one. FPGAs are not quite as efficient as a dedicated ASIC, but it's about as close as you can get without translating the actual source code directly into a circuit.

Intel, after purchasing FPGA manufacturer, Altera, will integrate their technology into Xeons in Q1 2016. This will be useful to offload specific tasks that dominate a server's total workload. According to PC World, they will be integrated as a two-chip package, where both the CPU and FPGA can access the same cache. I'm not sure what form of heterogeneous memory architecture that Intel is using, but this would be a great example of a part that could benefit from in-place acceleration. You could imagine a simple function being baked into the FPGA to, I don't know, process large videos in very specific ways without expensive copies.

Again, this is not a consumer product, and may never be. Reprogramming an FPGA can take hours, and I can't think of too many situations where consumers will trade off hours of time to switch tasks with high performance. Then again, it just takes one person to think of a great application for it to take off.

Source: PCWorld

Rumor: Only Xeon-based Skylake CPUs Getting AVX-512

Subject: Processors | May 27, 2015 - 09:45 PM |
Tagged: xeon, Skylake, Intel, Cannonlake, avx-512

AVX-512 is an instruction set that expands the CPU registers from 256-bit to 512-bit. It comes with a core specification, AVX-512 Foundation, and several extensions that can be added where it makes sense. For instance, AVX-512 Exponential and Reciprocal Instructions (ERI) help solve transcendental problems, which occur in geometry and are useful for GPU-style architectures. As such, it appears in Knights Landing but not anywhere else.

intel-2015-instruction-set-support.png

Image Credit: Bits and Chips

Today's rumor is that Skylake, the successor to Broadwell, will not include any AVX-512 support in its consumer parts. According to the lineup, Xeons based on Skylake will support AVX-512 Foundation, Conflict Detection Instructions, Vector Length Extensions, Byte and Word Instructions, and Double and Quadword Instructions. Fused Multiply and Add for 52-bit Integers and Vector Byte Manipulation Instructions will not arrive until Cannonlake shrinks everything down to 10nm.

The main advantage of larger registers is speed. When you can fit 512 bits of data in a memory bank and operate upon it at once, you are able to do several, linked calculations together. AVX-512 has the capability to operate on sixteen 32-bit values at the same time, which is obviously sixteen times the compute performance compared with doing just one at a time... if all sixteen undergo the same operation. This is especially useful for games, media, and other, vector-based workloads (like science).

This also makes me question whether the entire Cannonlake product stack will support AVX-512. While vectorization is a cheap way to get performance for suitable workloads, it does take up a large amount of transistors (wider memory, extra instructions, etc.). Hopefully Intel will be able to afford the cost with the next die shrink.