Subject: Processors | May 27, 2015 - 09:45 PM | Scott Michaud
Tagged: xeon, Skylake, Intel, Cannonlake, avx-512
AVX-512 is an instruction set that expands the CPU registers from 256-bit to 512-bit. It comes with a core specification, AVX-512 Foundation, and several extensions that can be added where it makes sense. For instance, AVX-512 Exponential and Reciprocal Instructions (ERI) help solve transcendental problems, which occur in geometry and are useful for GPU-style architectures. As such, it appears in Knights Landing but not anywhere else.
Image Credit: Bits and Chips
Today's rumor is that Skylake, the successor to Broadwell, will not include any AVX-512 support in its consumer parts. According to the lineup, Xeons based on Skylake will support AVX-512 Foundation, Conflict Detection Instructions, Vector Length Extensions, Byte and Word Instructions, and Double and Quadword Instructions. Fused Multiply and Add for 52-bit Integers and Vector Byte Manipulation Instructions will not arrive until Cannonlake shrinks everything down to 10nm.
The main advantage of larger registers is speed. When you can fit 512 bits of data in a memory bank and operate upon it at once, you are able to do several, linked calculations together. AVX-512 has the capability to operate on sixteen 32-bit values at the same time, which is obviously sixteen times the compute performance compared with doing just one at a time... if all sixteen undergo the same operation. This is especially useful for games, media, and other, vector-based workloads (like science).
This also makes me question whether the entire Cannonlake product stack will support AVX-512. While vectorization is a cheap way to get performance for suitable workloads, it does take up a large amount of transistors (wider memory, extra instructions, etc.). Hopefully Intel will be able to afford the cost with the next die shrink.
Subject: General Tech, Graphics Cards, Processors | July 19, 2014 - 03:05 AM | Scott Michaud
Tagged: Xeon Phi, xeon, Intel, avx-512, avx
It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.
This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.
|Conflict Detection Instructions||Yes||Yes||Yes|
|Exponential and Reciprocal Instructions||No||Yes||Yes|
|Byte and Word Instructions||Yes||No||No|
|Doubleword and Quadword Instructions||Yes||No||No|
|Vector Length Extensions||Yes||No||No|
Source: Intel AVX-512 Blog Post (and my understanding thereof).
So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.
Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).
For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.
This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.