AMD Vega GPU Architecture Preview: Redesigned Memory Architecture
Primitive Shader, Tile-based Rasterization
A New Geometry Primitive Shader
With Vega GPU architecture AMD is aiming to reinvent and the geometry pipeline. One of the fundamental problems with modern GPU rasterization is the need to filter through polygons that will never be seen in order to shade only the pixel necessary to render the output for the display. AMD gives an example from a 4K scene in the latest Deus Ex that starts with 220M polygons of data though only 2M of which are viewable. That 100x reduction in a significant part of the GPU development side of things, culling and order independence and other features.
The new programmable geometry pipeline on Vega will offer up to 2x the peak throughput per clock compared to previous generations by utilizing a new “primitive shader.” This new shader combines the functions of vertex and geometry shader and, as AMD told it to me, “with the right knowledge” you can discard game based primitives at an incredible rate. This right knowledge though is the crucial component – it is something that has to be coded for directly and isn’t something that AMD or Vega will be able to do behind the scenes.
This primitive shader type could be implemented by developers by simply wrapping current vertex shader code that would speed up throughput (to that 2x rate) through recognition of the Vega 10 driver packages. Another way this could be utilized is with extensions to current APIs (Vulkan seems like an obvious choice) and the hope is that this kind of shader will be adopted and implemented officially by upcoming API revisions including the next DirectX. AMD views the primitive shader as the natural progression of the geometry engine and the end of standard vertex and geometry shaders. In the end, that will be the complication with this new feature (as well as others) – its benefit to consumers and game developers will be dependent on the integration and adoption rates from developers themselves. We have seen in the past that AMD can struggle with pushing its own standardized features on the industry (but in some cases has had success ala FreeSync).
A Next-Generation Compute Unit (NCU)
Vega will introduce a revision and update to the compute unit that AMD has been evolving over the past years under the GCN name. We really don’t know much except that single precision operations per clock per compute unit will be the same at 128. What is new is that Vega will support 16-bit (half precision) and 8-bit ops in a packed math form, essentially doubling the operation throughput at 16-bit to 256. While previous architectures supported 16-bit math previously, Vega will take optimization and throughput to a higher level, making the NCU the “most flexible mixed precision compute unit” in the industry.
Without more information on NCU quantities or clock speeds we cannot make any judgements on estimated GPU performance or how different the NCU will be from the current CU being used in Polaris.
Next-Generation Pixel Engine – A Tile-based Approach
Before the subheading above gets your knickers in a twist, let me be sure to point out that anything that happens in the new Vega Draw Space Binning Rasterizer is simply one option that the GPU will have available to it. The goal of the DSBR is to save power and improve performance in some instances. The goal of every rasterizer is to cull pixels invisible to the scene so you are only shading pixels required at display time. Vega’s new DSBR will be able to do this in a tile-based manner, a rasterization technique that has traditionally been limited to mobile SoCs like Qualcomm’s Adreno but that even NVIDIA’s Maxwell architecture implemented for desktop users.
The Draw Space Binning Rasterizer is using cache aware information to capture batches of primitives in a way that has two positive effects. First, you will very often find multiple hits in the same proximity and second, this creates a new way to determine which pixels to shade. This reduces access to memory and to the off-chip caches to save power. This increases effective bandwidth on lower cost parts but can reduce power on even the highest performing GPU implementations.
Render Back-Ends Gain Access to L2
The final bit of information we know about the upcoming Vega architecture revolves around the render back-ends, or ROPs. In a legacy architecture like Polaris, memory accesses for pixel and texture data were non-coherent. This means you couldn't render to a texture and read it again without re-accessing memory. This behavior is common on current gaming and in particular for VR systems like Oculus that output a final image to a texture that is then modified again by the Oculus runtime. Now the ROPs will be able to access the L2 cache, improving performance for that VR implementation as well as for any game engine that uses deferred shading.
The beginning of the end…
…of the beginning. Or something to that effect. Today marks the start of official data dumps about Vega 10 and its associated products. If Polaris is our guide, I would expect to see subsequent information releases by AMD to maintain a balanced level of excitement, curiosity and braggadocios behavior.
By far the most interesting information released today is the move to include external memory options for the GPU other than simply the on-board memory, now referred to High Bandwidth Cache (HBC). The potential for this kind of memory system is substantial though I would wager the impact on enthusiast gaming will be minimal out of the gate and for a couple of generations. For professional and enterprise use cases though, having access to a cohesive memory system that includes HBM2, flash, system and network storage could create a massive disruption in the development cycle.
It sounds cliché, but it continues to look like it’s going to be true, but 2017 is shaping up to be a substantially transformative year for computing and gaming and AMD is definitely going to have its say on the matter.