It comes after 8, but before 10

We finally have some more information about the graphics arthitecture found in Skylake.

As the week of Intel’s Developer Forum (IDF) begins, you can expect to see a lot of information about Intel’s 6th Generation Core architecture, codenamed Skylake, finally revealed. When I posted my review of the Core i7-6700K, the first product based on that architecture to be released in any capacity, I was surprised that Intel was willing to ship product without the normal amount of background information for media and developers. Rather than give us the details and then ship product, which has happened for essentially every consumer product release I have been a part of, Intel did the reverse: ship a consumer friendly CPU and then promise to tell us how it all works later in the month at IDF.

Today I came across a document posted on Intel’s website that dives into very specific detail on the new Gen9 graphics and compute architecture of Skylake. Details on the Core architecture changes are not present, and instead we are given details on how the traditional GPU portion of the SoC has changed. To be clear: I haven’t had any formal briefing from Intel on this topic or anything surrounding the architecture of Skylake or the new Gen9 graphics system but I wanted to share the details we found available. I am sure we’ll learn more this week as IDF progresses so I will update this story where necessary.

What Intel calls Processor Graphics is what we used to call simply integrated graphics for the longest time. The purpose and role of processor graphics has changed drastically over the years and it is now not only responsible for 3D graphics rendering but compute, media and display capabilities of the Intel Skylake SoC (when discrete add-in graphics is not used). The architecture document used to source this story focuses on Gen9 graphics, the compute architecture utilized in the latest Skylake CPUs. The Intel HD Graphics 530 on the Core i7-6700K / Core i5-6600K is the first product released and announced using Gen9 graphics and is also the first to adopt Intel’s new 3-digit naming scheme.

This die shot of the Core i7-6700K shows the increased size and prominence of the Gen9 graphics in the overall SoC design. Containing four traditional x86 CPU cores and 1 “slice” implementation of Gen9 graphics (with three visible sub-slices we’ll describe below), this is not likely to be the highest performing iteration of the latest Intel HD Graphics technology.

Like the Intel processors before it, the Skylake design utilizes a ring bus architecture to connect the different components of the SoC. This bi-directional interconnect has a 32-byte wide data bus and connects to multiple “agents” on the CPU. Each individual CPU core is considered its own agent while the Gen9 compute architecture is considered one complete agent. The system agent bundles the DRAM memory, the display controller, PCI Express and other I/O interface that communicate with the rest of the PC. Any off-chip memory requests and transactions occur through this bus while on-chip data transfers tend to be handled differently.

The Skylake-based SoCs can include additional caches like a shared LLC (last level cache) or the eDRAM (embedded DRAM) that made Intel’s Iris graphics so interesting last generation. If an LLC is included on the processor it will connect through the same ring bus interface and each on-die core is allocated a slice of the cache with access granted to the graphics agent too. The distributed LLC reduces apparent latency to external DRAM and increases effective bandwidth.

An optional eDRAM component could be included with the Gen9 graphics as well ranging from 64-128 MB. This memory is not on-die but is on-package, operating at its own clock rate as high as 1.6 GHz. It has separate buses for reads and writes capable of 32-bytes per cycle each. This eDRAM functionality increases for Gen9 over Gen8 implementations and acts as a memory-side cache between the LLC mentioned above and DRAM, again lowering apparent latency and raising effective memory bandwidth. This cache can be shared between the GPU and CPU cores on the chip. A new application for the eDRAM to improve power efficiency is dubbed “low-latency display surface refresh” – giving enabled systems the ability to repeat frame buffer output without accessing system memory at all.

In most ways, the Gen9 graphics and compute architecture is very similar to that of Gen8. There are some refinements in areas of memory hierarchy, compute capabilities and product configuration options however that will give products based on Skylake a significant performance advantage (as our first Core i7-6700K review showcased.)

From the memory angle, Gen9 graphics have several things to take note of. Write performance on coherent SVM (shared virtual memory) is increased thanks to new policies using the last level cache (LLC) while L3 cache capacity has been bumped up to 768 KB per “slice”. Request queue sizes for L3 and LLC have been increased to enhance latency hiding in many cases and texture samplers now support the NV12 YUV format that improves surface sharing capability for compute APIs. The most dramatic change is that move of the eDRAM memory controller to the system agent to allow for display refresh and to act as a memory side cache for the entire chip.

Oh, and let’s not forget that many implementations of Gen9 graphics on Skylake will take advantage of DDR4 for primary system memory rather than DDR3.

Compute capability improvements are a bit more obtuse and include things like thread-level preemption for compute applications, round-robin thread scheduling and native 32-bit atomic operation support.

Product flexibility is likely where consumers will see the biggest advantage as it means more specific GPU iterations for each product segment in the market. Gen9 graphics can be designed in 1-3 slice configurations supporting as many as 72 EUs (execution units) and GPU implementations can be modified on sub-slice basis to help with binning and yields. Additional power gating and clock domains can enable drastically improved power efficiency for media playback.

The Execution Unit (EU) itself remains very similar to the Gen8 design we saw on Intel’s previous generation of processors.

Each EU can support 7 threads and has 128 general purpose registers. They support SMT and IMT multi-threading and utilize a pair of SIMD FPUs for both floating point and integer based computing and are capable of 16 32-bit floating point operations every cycle.

In the Gen9 graphics architecture these EUs are organized in subslices which can be architected to any number of EUs. However, documentation shows that for most purposes Intel has settled on 8 EUs per subslice as the best balance of hardware and efficiency. Each subslice contains its own local thread dispatcher and its own instruction caches. Sampler units (normally called texture units with other GPUs) and a data port (load/store) are included on each subslice as well. Compared to Haswell-based Gen7.5 graphics architecture, both Gen8 and Gen9 lowered per-subslice EU counts from 10 to 8, improving local bandwidth internally on the structure.

The texture/sampler units in each subslice are read-only memory fetch units to sample texture and image surfaces. It has dedicated L1 and L2 cache and supports common compression and decompression formats while doing the work for texture filtering including anisotropic.

A combination of subslices is called, as you might expect, a slice. For most of the Gen9 based products there will be three subslices included in each slice for a total of 24 EUs. Just like we see in other discrete GPU solutions, you will likely see some instances of Intel disabling EUs in order to improve yields, power consumption or just to create market differentiation.

Additional logic is included in each slice for thread dispatch, additional L3 cache, and fixed function logic for atomic operations. For the Gen9 graphics architecture that L3 size has increased to 768 KB. Application context will determine how much of this L3 cache is distributed to data cache, system buffers for pipeline handling and simple shared local memory. For example, 3D workloads tend to need more system buffers for fixed function pipelines while compute applications the need for data cache is heavier.

Each subslice has individual access to the L3 cache for both sampler/texture and data connectivity and that enables 64 bytes per connection per cycle, or 192 bytes total, which can be accessed in aggregate. Filling that cache only occurs at 64 bytes per cycle should there be a cache miss.

A total product architecture is a combination of one or more slices with additional front end logic for command submission, fixed function hardware for 3D and media pipelines as well as a dedicated graphics technology interface (GTI) for accessing the rest of the components on the SoC.

The above diagram shows the Intel Gen9 graphics implementation on the Core i7-6700K, utilizing 24 EUs. The command streamer takes submitted work from driver stacks and properly routes commands to the appropriate units. As compute workloads arrive the thread dispatcher is responsible for thread distribution on EUs based on load.

Unlike the rest of the processor that needs to communicate through the ring bus, the Gen9 graphics architecture is able to talk with the LLC, eDRAM and even primary system memory through the GTI. It also allows the GPU and CPU to communicate efficiently and gives the CPU cores access to the global memory atomics. The GTI is also responsible for interfacing the GPU clock and other SoC clock domains to implement power management.

Possible larger configurations for Gen9 graphics on Skylake

The bus between the GTI and the LLC is adjustable based on performance needs and power concerns. As in Gen8 designs, a processor can utilize 64 byte per cycle reads and writes for the best performance or a product design could implement a 32 byte per cycle write limit to improve power efficiency at a cost of performance.

New to Gen9 graphics is full support for global memory coherency between the CPU cores and the GPU. Along with that change comes support for the new Intel VT (Virtualization Technology) Directed I/O updates that include support for shared memory between the CPU cores and graphics architecture in a virtualized environment. This opens up some interesting opportunities for new platform designs going forward.

These new mechanisms to maintain memory coherency enable better support for emerging new APIs like OpenCL and DirectCompute. The net result is that pointer-heavy applications and structures can be shared directly between the CPU cores and GPU infrastructure without the performance crippling and power hungry copying of data. This has long been the promise of heterogeneous computing paradigms and Intel’s Gen9 graphics and compute architecture get us closer to that as a universal standard.

This diagram gives us a great view of the entire memory hierarchy on a Skylake + Gen9 SoC with per-clock bandwidth data labeling each connection pathway.

As of today, with only a single configuration of the Gen9 graphics actually announced, the data in the table below is woefully incomplete. But it does give us the ability to measure and estimate performance capability of the current Intel HD Graphics 530 and future designs with more EUs.

At 384 FLOP/cycle and peak clock rate of 1.15 GHz according to Intel’s website, that give the Intel HD Graphics 530 a peak throughput of 441.6 single precision GFLOPS.  That is great for an Intel integrated GPU but only puts it between the performance available to a GeForce GT 720 and GT 730 when compared to discrete components and a little more than half the peak performance of AMD’s Kaveri APUs  (845 GFLOPS).

However, if you extrapolate from the 2- and 3-slice configurations of Gen9 that we assume will happen at some point, banking on the same frequency, Intel could have future parts running into the 880 GFLOPS and 1.32 TFLOPS territory. Add in the benefits of the eDRAM on future Iris designs using Skylake and you will have some very competitive integrated solutions from Intel this year.

Our overall picture of Intel’s Gen9 graphics and compute architecture, and Skylake in general, just got a little clearer with the release of this information today before IDF. I still need to know what other SKUs and configurations are going to be built based on these designs and if Intel will indeed build a processor with a 72 EU implementation – it would be an interesting move for a company that hasn’t put the same emphasis on processor graphics that AMD has, traditionally.

Expect much more on Skylake and the future of Intel’s 6th Generation Core products this week from IDF!