The GM204 Architecture

NVIDIA’s new GM204 GPUs are finally revealed.

James Clerk Maxwell's equations are the foundation of our society's knowledge about optics and electrical circuits. It is a fitting tribute from NVIDIA to include Maxwell as a code name for a GPU architecture and NVIDIA hopes that features, performance, and efficiency that they have built into the GM204 GPU would be something Maxwell himself would be impressed by. Without giving away the surprise conclusion here in the lead, I can tell you that I have never seen a GPU perform as well as we have seen this week, all while changing the power efficiency discussion in as dramatic a fashion.

To be fair though, this isn't our first experience with the Maxwell architecture. With the release of the GeForce GTX 750 Ti and its GM107 GPU, NVIDIA put the industry on watch and let us all ponder if they could possibly bring such a design to a high end, enthusiast class market. The GTX 750 Ti brought a significantly lower power design to a market that desperately needed it, and we were even able to showcase that with some off-the-shelf PC upgrades, without the need for any kind of external power.

That was GM107 though; today's release is the GM204, indicating that not only are we seeing the larger cousin of the GTX 750 Ti but we also have at least some moderate GPU architecture and feature changes from the first run of Maxwell. The GeForce GTX 980 and GTX 970 are going to be taking on the best of the best products from the GeForce lineup as well as the AMD Radeon family of cards, with aggressive pricing and performance levels to match. And, for those that understand the technology at a fundamental level, you will likely be surprised by how much power it requires to achieve these goals. Toss in support for things like a new AA method, Dynamic Super Resolution, and even improved SLI performance and you can see why doing it all on the same process technology is impressive.

The NVIDIA Maxwell GM204 Architecture

The NVIDIA Maxwell GM204 graphics processor was built from the ground up with an emphasis on power efficiency. As it was stated many times during the technical sessions we attended last week, the architecture team learned quite a bit while developing the Kepler-based Tegra K1 SoC and much of that filtered its way into the larger, much more powerful product you see today. This product is fast and efficient, but it was all done while working on the same TSMC 28nm process technology used on the Kepler GTX 680 and even AMD's Radeon R9 series of products.

The fundamental structure of GM204 is setup like the GM107 product shipped as the GTX 750 Ti. There is an array of GPCs (Graphics Processing Clustsers), each comprised of multiple SMs (Streaming Multiprocessors, also called SMMs for this Maxwell derivative) and external memory controllers. The GM204 chip (the full implementation of which is found on the GTX 980), consists of 4 GPCs, 16 SMMs and four 64-bit memory controllers.

Each SMM features 128 CUDA cores, or stream processors, bringing the total for this product to 2048. That is significant drop from the 2880 CUDA cores found in the GTX 780 Ti (full GK110 chip) but as you'll soon find out, thanks to the higher clock speeds and performance efficiency changes, the GTX 980 matches or beats the GTX 780 Ti in every test we have run. 

The SMM also features an improved Polymorph engine for geometry processing and 8 texture units. All of the above combines for a total of 128 texture units, a 256-bit memory bus, 64 ROPs (raster operators), and 2MB of L2 cache. 

The GeForce GTX 680, the GK104 based graphics card that was launched in March of 2012, provides some interesting comparisons. First, the obvious: the GTX 980 will have 33% more processor cores, higher clock speeds, and thus a much higher peak compute rate. Texture unit count remains the same but the doubling of ROP units gives the Maxwell GPU better performance in high resolution anti-aliasing. Memory bandwidth is increased by a modest amount, but NVIDIA has made more enhancements to improve in that area as well with GM204.

Look at those bottom four statistics though: GM204 is 1.66 billion transistors larger and has a 35% larger die size yet is able to quote a TDP that is 30 watts lower than GK104 using the same 28nm process.  If you take performance into consideration though, the GTX 980 should be going up against the GK110-based GTX 780 Ti – a GPU that has a 250 watt TDP and a 7.1 billion transistor count (and a die size of 551 mm^2). As we dive into the benchmarks on the following pages, you will gain an understanding of why this is so interesting.

The SMMs of Maxwell are a fundament change when compared to Kepler. Rather than a single block of 192 shaders, the SMM is divided into four distinct blocks that each have a separate instruction buffer, scheduler, and 32 dedicated, non-shared CUDA cores.  NVIDIA states that this simplifies the design and scheduling logic required for Maxwell saving on area and power.  Pairs of these blocks are grouped together and share four texture filtering units and a texture cache.  Shared memory is a different pool of data that is shared amongst all four processing blocks of the SMM.

With these changes, the SMM can offer 90% of the compute performance of the Kepler SM but with a smaller die area that allows NVIDIA to integrate more of them per die.  GK104 had 8 SMs (1536 CUDA cores) while GM204 has 16 SMs (2048 CUDA cores) giving it a 2x SM density advantage. 

NVIDIA's Jonah Alben indicated to us that the 192-core based SMs used in Kepler seemed great at the time but that they introduced quite a few inefficiencies due to a non-power-of-2 count. It was more difficult for the scheduling hardware to keep the cores full and processing on data and the move to 128-core SMMs helps address this. Speaking of scheduling, there were efficiency changes there as well. The arrangement of instruction scheduling now occurs much earlier in the pipeline, preventing re-scheduling in many instances, helping to keep power down and performance high. 

Other than the dramatic changes to the SMM, the 2 MB L2 cache that NVIDIA has implemented on Maxwell is another substantial change.  Considering that the Kepler had an L2 cache implementation at 512 KB, we are seeing an 8x increase in available capacity which should reduce the demand on the integrated memory controller of GM204 dramatically.

Texture fill rate between the GK104 and GM204 increases by 12% (thanks to clock speeds, not texture unit counts) though pixel fill rate more than doubles, going from 32.2 Gpixels/s to 72 Gpixels/s on GTX 980.

A 256-bit memory bus might seem like a downgrade for a flagship card as the GTX 780 Ti featured a 384-bit offering (though the GTX 680 featured a 256-bit controller as well). But there are several changes NVIDIA has made to improve memory performance with that smaller bus. First, the clock speed of the memory is now 7.0 GHz and the GM204 cache is larger and more efficient, reducing the number of memory requests that have to be made into DRAM. 

Another change is the implementation of a third-generation delta color compression algorithm that attempts to lower the bandwidth required for any single operation. The compression happens both when data is written out to memory and when it is read again for the application, attempting to get as high as 8:1 compression on blocks of matching color value (4×2 pixel regions). The delta color compression compares neighboring color blocks and attempts to minimize the number of bits stored by looking at color differences. Obviously, if the data is very random and cannot be compressed at all, then it will just be written to the memory in a 1:1 mode.

NVIDIA claims that Maxwell requires 25% less memory bandwidth on a per-frame basis when you combine the improved caching and compression techniques on Maxwell. As a result, even though the raw GB/s values of GM204 are only marginally higher than that of GK104, the effective memory bandwidth of the new GTX 980/970 cards will appear much better to developers and in games. 

There are other changes in GM204 that do not exist in GM107 to help improve performance of certain features that NVIDIA is bringing to the market. Those help build the foundation for VXGI global illumination and more.

« PreviousNext »