Review Index:

NVIDIA Fermi Next Generation GPU Architecture Overview

Manufacturer: NVIDIA

GT300? Yes please

Today we are able to reveal some of the more interesting features of NVIDIA’s next generation GPU architecture known internally as “Fermi”.  NVIDIA refers to Fermi as the “most significant leap forward in GPU architecture since the original G80” and after reading through the documentation, it is hard to argue against their case.  The GT200 architecture that powers the GTX 285 and GTX 295 of today was a big improvement over G80 though it was fundamentally based on the same design principles.

The massive, 3.0 billion, 512 SP Fermi core.  I think I can see my house from here.

NVIDIA Fermi takes GPU computing another step forward and that is clearly the primary goal of the new architecture.  We will see that NVIDIA has focused on items like double precision floating point, memory technologies like ECC support and caches and context switching between GPU applications to directly target its CUDA architecture and what NVIDIA believes is the future of parallel computing.

The Fermi Architecture

At a high level, the new Fermi architecture was designed to map directly to NVIDIA’s interpretation of CUDA computing going forward.  In this program execution model there are threads, thread blocks, and grids of thread blocks that all differentiate themselves based on memory access and kernel execution.

A thread block is a group of threads that have the ability to cooperate with each other and communicate via the per-Block shared memory.  Each block supports as many as 1536 concurrent threads, each of which has separate access to individual memory, counters, registers, etc.  Each grid is actually then an array of thread blocks that are running the same kernel but have the ability to read and write from global memory (but only after kernel-wide global synchronization). 

These software stacks match up with NVIDIA architecture in the form of the GPU, streaming multiprocessors and CUDA cores.  The GPU itself operates on the grids of thread blocks, each array of streaming multiprocessors (SMs) executes one or more thread blocks and the individual CUDA cores (as NVIDIA is calling them now) execute the threads.  The SMs execute threads in groups of 32 called “warps” that help to improve efficiency of the GPU.

The first implementation of this architecture, that we are tentatively calling GT300, will have some impressive raw specifications.  The GPU is made up of 3.0 billion transistors and features 512 CUDA processing cores organized into 16 streaming multiprocessors of 32 cores each.  The memory architecture is built around a new GDDR5 implementation and has six channels of 64-bits for a total memory bus of 384-bits.  The memory system can technically support up to 6GB of memory as well – something that is key for HPC applications. 

Each SM includes 32 CUDA processing cores (4x the previous GT200 design) as you can see above but also introduces other new features to help improve performance.  Each processor includes a fully pipelined integer and floating point unit that implements the newer IEEE 754-2008 standard – another important move for GPU computing.  The new Evergreen core from AMD also implements this standard as it adds support for the fused multiply-add instruction.

Also included in each SM are 16 load and store units and 4 special function units to handle calculations like sin and cosine. 

NVIDIA is claiming that the double precision performance of the Fermi architecture will be greatly improved over the existing GT200 design.

With NVIDIA claiming to be 4.25x faster than GT200, that puts the GT300 at about 330 GFLOPS of double precision performance (based on the 78 GFLOPS the GT200 rests at).  (UPDATE: During Jen-Hsun's keynote at the NVIDIA GPU Tech Conference, they stated the peak DP performance increase was "8x".  If that's the case, GT300 could reach as high as 624 GFLOPS.  We will find out the final answer soon.)  While definitely an impressive improvement, AMD’s new Evergreen family reaches a theoretical peak of 544 GFLOPS of double precision performance, so we definitely need to keep an eye on these numbers as we see actual hardware from NVIDIA hit the streets.