A few secrets about GTX 970

NVIDIA has come forward with even more data explaining the memory issue affecting the GeForce GTX 970 cards.

UPDATE 1/28/15 @ 10:25am ET: NVIDIA has posted in its official GeForce.com forums that they are working on a driver update to help alleviate memory performance issues in the GTX 970 and that they will "help out" those users looking to get a refund or exchange.

Yes, that last 0.5GB of memory on your GeForce GTX 970 does run slower than the first 3.5GB. More interesting than that fact is the reason why it does, and why the result is better than you might have otherwise expected. Last night we got a chance to talk with NVIDIA’s Senior VP of GPU Engineering, Jonah Alben on this specific concern and got a detailed explanation to why gamers are seeing what they are seeing along with new disclosures on the architecture of the GM204 version of Maxwell.

NVIDIA's Jonah Alben, SVP of GPU Engineering

For those looking for a little background, you should read over my story from this weekend that looks at NVIDIA's first response to the claims that the GeForce GTX 970 cards currently selling were only properly utilizing 3.5GB of the 4GB frame buffer. While it definitely helped answer some questions it raised plenty more which is whey we requested a talk with Alben, even on a Sunday.

Let’s start with a new diagram drawn by Alben specifically for this discussion.

GTX 970 Memory System

Believe it or not, every issue discussed in any forum about the GTX 970 memory issue is going to be explained by this diagram. Along the top you will see 13 enabled SMMs, each with 128 CUDA cores for the total of 1664 as expected. (Three grayed out SMMs represent those disabled from a full GM204 / GTX 980.) The most important part here is the memory system though, connected to the SMMs through a crossbar interface. That interface has 8 total ports to connect to collections of L2 cache and memory controllers, all of which are utilized in a GTX 980. With a GTX 970 though, only 7 of those ports are enabled, taking one of the combination L2 cache / ROP units along with it. However, the 32-bit memory controller segment remains.

You should take two things away from that simple description. First, despite initial reviews and information from NVIDIA, the GTX 970 actually has fewer ROPs and less L2 cache than the GTX 980. NVIDIA says this was an error in the reviewer’s guide and a misunderstanding between the engineering team and the technical PR team on how the architecture itself functioned. That means the GTX 970 has 56 ROPs and 1792 KB of L2 cache compared to 64 ROPs and 2048 KB of L2 cache for the GTX 980. Before people complain about the ROP count difference as a performance bottleneck, keep in mind that the 13 SMMs in the GTX 970 can only output 52 pixels/clock and the seven segments of 8 ROPs each (56 total) can handle 56 pixels/clock. The SMMs are the bottleneck, not the ROPs.

  GeForce GTX 980 GeForce GTX 970 (Corrected)
GPU Code name GM204 GM204
GPU Cores 2048 1664
Rated Base Clock 1126 MHz 1050 MHz
Texture Units 128 104
ROP Units 64 56
L2 Cache 2048 KB 1792 KB
Memory 4GB 4GB
Memory Clock 7000 MHz 7000 MHz
Memory Interface 256-bit 256-bit
Memory Bandwidth 224 GB/s 224 GB/s*
TDP 165 watts 145 watts
Peak Compute 4.61 TFLOPS 3.49 TFLOPS
MSRP $549 $329

*To those wondering how peak bandwidth would remain at 224 GB/s despite the division of memory controllers on the GTX 970, Alben stated that it can reach that speed only when memory is being accessed in both pools.

Second to that, it turns out the disabled SMMs have nothing to do with the performance issues experienced or the memory system complications.

Full GM204 Block Diagram

In a GTX 980, each block of L2 / ROPs directly communicate through a 32-bit portion of the GM204 memory interface and then to a 512MB section of on-board memory. When designing the GTX 970, NVIDIA used a new capability of Maxwell to implement the system in an improved fashion than would not have been possible with Kepler or previous architectures. Maxwell’s configurability allowed NVIDIA to disable a portion of the L2 cache and ROP units while using a “buddy interface” to continue to light up and use all of the memory controller segments. Now, the SMMs use a single L2 interface to communicate with both banks of DRAM (on the far right) which does create a new concern.

A quick note about the GTX 980 here: it uses a 1KB memory access stride to walk across the memory bus from left to right, able to hit all 4GB in this capacity. But the GTX 970 and its altered design has to do things differently. If you walked across the memory interface in the exact same way, over the same 4GB capacity, the 7th crossbar port would tend to always get twice as many requests as the other port (because it has two memories attached).  In the short term that could be ok due to queuing in the memory path.  But in the long term if the 7th port is fully busy, and is getting twice as many requests as the other port, then the other six must be only half busy, to match with the 2:1 ratio.  So the overall bandwidth would be roughly half of peak. This would cause dramatic underutilization and would prevent optimal performance and efficiency for the GPU.

Let's be blunt here: access to the 0.5GB of memory, on its own and in a vacuum, would occur at 1/7th of the speed of the 3.5GB pool of memory.

To avert this, NVIDIA divided the memory into two pools, a 3.5GB pool which maps to seven of the DRAMs and a 0.5GB pool which maps to the eighth DRAM.  The larger, primary pool is given priority and is then accessed in the expected 1-2-3-4-5-6-7-1-2-3-4-5-6-7 pattern, with equal request rates on each crossbar port, so bandwidth is balanced and can be maximized. And since the vast majority of gaming situations occur well under the 3.5GB memory size this determination makes perfect sense. It is those instances where memory above 3.5GB needs to be accessed where things get more interesting.

Let's be blunt here: access to the 0.5GB of memory, on its own and in a vacuum, would occur at 1/7th of the speed of the 3.5GB pool of memory. If you look at the Nai benchmarks floating around, this is what you are seeing.

Check the result on the left: 22.35 GB/s is almost exacly 1/7th of 150 GB/s

But the net result for gaming scenarios is much less dramatic than that, so why is that the case? It comes down to the way that memory is allocated by the operating system for applications and games. As memory is requested by a game, the operating system will allocate portions for it depending on many factors. These include the exact data space that the game asked for, what the OS has available and what the heuristic patterns of the software models deem at the time. Not all memory is accessed in the same way, even for PC games.

UPDATE 1/27/15 @ 5:36pm ET: I wanted to clarify a point on the GTX 970's ability to access both the 3.5GB and 0.5GB pools of data at the same. Despite some other outlets reporting that the GPU cannot do that, Alben confirmed to me that because the L2 has multiple request busses, the 7th L2 can indeed access both memories that are attached to it at the same time.

If a game has allocated 3GB of graphics memory it might be using only 500MB on a regular basis with much of the rest only there for periodic, on-demand use. Things like compressed textures that are not as time sensitive as other material require much less bandwidth and can be moved around to other memory locations with less performance penalty. Not all allocated graphics memory is the same and innevitably there are large sections of this storage that is reserved but rarely used at any given point in time.

All gaming systems today already have multiple pools of graphics memory – what exists on the GPU and what the system memory has to offer via the PCI Express bus. With the GTX 970 and its 3.5GB/0.5GB division, the OS now has three pools of memory to access and to utilize. Yes, the 0.5GB of memory in the second pool on the GTX 970 cards is slower than the 3.5GB of memory but it is at least 4x as fast as the memory speed available through PCI Express and system memory. The goal for NVIDIA then is that the operating system would utilize the 3.5GB of memory capacity first, then access the 0.5GB and then finally move to the system memory if necessary.

The question then is, what is the real-world performance penalty of the GTX 970’s dual memory pool configuration? Though Alben didn’t have a specific number he wanted to discuss he encouraged us to continue doing our own testing to find cases where you can test games requesting less than 3.5GB of memory and then between 3.5GB and 4.0GB. By comparing the results on the GTX 980 and the GTX 970 in these specific scenarios you should be able to gauge the impact that the slower pool of memory has on the total memory configuration and gaming experience. The problem and risk is that this performance difference essentially depends on the heuristics of the OS and its ability to balance the pools effectively, putting data that needs to be used less frequently or in a less latency-dependent fashion in the 0.5GB portion.

NVIDIA’s performance labs continue to work away at finding examples of this occurring and the consensus seems to be something in the 4-6% range. A GTX 970 without this memory pool division would run 4-6% faster than the GTX 970s selling today in high memory utilization scenarios. Obviously this is something we can’t accurately test though – we don’t have the ability to run a GTX 970 without a disabled L2/ROP cluster like NVIDIA can. All we can do is compare the difference in performance between a reference GTX 980 and a reference GTX 970 and measure the differences as best we can, and that is our goal for this week.

Accessing that 500MB of memory on its own is slower. Accessing that 500MB as part of the 4GB total slows things down by 4-6%, at least according to NVIDIA. So now the difficult question: did NVIDIA lie to us?

At the very least, the company did not fully disclose the missing L2 and ROP partition on the GTX 970, even if it was due to miscommunication internally. The question “should the GTX 970 be called a 3.5GB card?” is more of a philosophical debate. There is 4GB of physical memory on the card and you can definitely access all 4GB of when the game and operating system determine it is necessary. But 1/8th of that memory can only be accessed in a slower manner than the other 7/8th, even if that 1/8th is 4x faster than system memory over PCI Express. NVIDIA claims that the architecture is working exactly as intended and that with competent OS heuristics the performance difference should be negligible in real-world gaming scenarios.

The performance of the GTX 970 is what the performance is. This information is incredibly interesting and warrants some debate, but at the end of the day, my recommendations for the GTX 970 really won’t change at all.

The configurability of the Maxwell architecture allowed NVIDIA to make this choice. Had the GeForce GTX 970 been built on the Kepler architecture, the company would have had to disable the entire L2/MC block on the right hand side, resulting in a 192-bit memory bus and a 3GB frame buffer. GM204 allows NVIDIA to expand that to a 256-bit 3.5GB/0.5GB memory configuration and offers performance advantages, obviously.

Alternatively to calling this a 4GB card, NVIDIA might have branded it as 3.5GB with the addition of 500MB of “cache” or “buffer” – something that designates its difference in implementation, its slower performance but also its advantages over not having it at all.

Let’s be clear – the performance of the GTX 970 is what the performance is. This information is incredibly interesting and warrants some debate, but at the end of the day, my recommendations for the GTX 970 really won’t change at all. It still offers incredible performance for your dollar and is able to run at 4K in my experience and testing. Yes, there might in fact be specific instances where performance drops are more severe because of this memory hierarchy design, but I don’t think it changes the outlook for the card as a whole.

Some other trailing notes. There should be no difference in performance or memory configuration results from one implementation of the GTX 970 to another. If your GTX 970 exhibits an issue (or does not) then your friends and his friends should match the whole way. The details about the memory issue also show us that a pending GeForce GTX 960 Ti, if it exists, will not necessarily have this complication. Imagine a GM204 GPU with a 192-bit memory bus, 3GB of GDDR5 and fewer enabled SMMs and you likely have a product you’ll see in 2015. (Interestingly, you have basically just described exactly the GTX 970M mobile variant.)

This is not the first time that NVIDIA has used interesting memory techniques to adjust performance characteristics of a card. The GTX 550 Ti and the GTX 660 Ti both used unbalanced memory configurations, allowing a GPU with a 192-bit memory bus to access 2GB. This also required some specific balancing on NVIDIA's side to make sure that the 64-bit portion of that GPU's memory controller with double the memory of the other two didn't weigh memory throughput down in the 1.5 GB to 2.0 GB range. NVIDIA was succeeded there an the GTX 660 Ti was one of the company's most successful products of the generation.

UPDATE 1/27/15 @ 8:50pm ET: I also got some more clarification on the relationship between the GTX 660 Ti and the GTX 970 memory implementation. As it turns out, the GTX 660 Ti with the unbalanced memory system (one memory controller having access to more DRAM) also reported separate pools of memory, one at 1.5GB and one at 0.5GB, to the operating system. The difference is that the performance penalty between them was not nearly as severe as the delta we are seeing here on the GTX 970. Still, Alben claims that the software tricks the company learned then were directly applicable to the GTX 970 and thus the integration should be improved over what we saw those years ago.

It would be interesting to see if future architectures that implement this kind of design should try to use drivers to better handle the heuristics of memory allocation. Surely NVIDIA’s driver should know better which assets could be placed in the slower pools of memory without affecting gaming performance better than Windows. I would imagine that this configurable architecture design will continue into the future and it’s possible it could be improved enough to allow NVIDIA to expand the pool sizes, improving efficiency even more and not affecting performance.

For users that are attempting to measure the impact of this issue you should be aware that in some cases the software you are using report the in-use graphics memory could be wrong. Some applications are only aware of the first "pool" of memory and may only ever show up to 3.5GB in use for a game. Other applications, including MSI Afterburner as an example, do properly report total memory usage of up to 4GB. Because of the unique allocation of memory in the system, the OS and driver and monitoring application may not always be on the page. Many users, like bootski over at NeoGAF have done a job of compiling examples where the memory issue occurs, so look around for the right tools to use to test your own GTX 970. (Side note: we are going to try to do some of our own testing this afternoon.)

NVIDIA has come clean; all that remains is the response from consumers to take hold. For those of you that read this and remain affronted by NVIDIA calling the GeForce GTX 970 a 4GB card without equivocation: I get it. But I also respectfully disagree. Should NVIDIA have been more upfront about the changes this GPU brought compared to the GTX 980? Absolutely and emphatically. But does this change the stance or position of the GTX 970 in the world of discrete PC graphics? I don’t think it does.

Leave me your thoughts in the comments below.