NVIDIA GT200: Moving Away from Just a GPU
GPU Computing and the GTX 280
While the GTX 280 is a wonderful card for games, NVIDIA is doing its best to sell their products in markets which are not quite so niche as high end PC gaming. As we all know, each GPU has a tremendous ability to do math. Theoretically this math throughput can address other markets, especially that of High Performance Computing. To that end NVIDIA has made quite a few modifications to their architecture to make the GTX 280 a much more flexible and adept part when it comes to high performance stream computing.
While not exactly a Dr. Jekyl and Mr. Hyde, the GTX 280 undergoes a slightly radical transformation into a stream processing behemoth.
The biggest change from the G8x architecture to the new GTX200 is that of going from 128 SPs to 240 SPs. While not exactly a doubling, these stream processors have been upgraded and optimized so that they offer up to 1.5X the performance from the previous stream processors in certain situations. NVIDIA has also changed around the organization of these units a bit as compared to the previous generation. In NVIDIA’s white paper they describe the units as such:
“Special function units (SFUs) in the SMs compute transcendental math, attribute interpolation (interpreting pixel attributes from a primitive’s vertex attributes), and perform floating-point MUL instructions. The individual streaming processing cores of GeForce GTX 200 GPUs can now perform near full-speed dual-issue of multiply-add operations (MADs) and MULs (3 flops/SP) by using the SP’s MAD unit to perform a MUL and ADD per clock, and using the SFU to perform another MUL in the same clock. Optimized and directed tests can measure around 93-94% efficiency.
The entire GeForce GTX 200 GPU SPA delivers nearly one teraflop of peak, single-precision, IEEE 754, floating-point performance.”
Each GTX200 is comprised of 10 Thread Processing Clusters (oddly enough called Texture Processing Clusters when in GPU mode). In each TPC there are 3 Stream Multi-processors. In each SM there are 8 stream processors. When you add all those up, you get the magical 240 number. The optimizations have not ended there. In each SM there is a dedicated “L1” cache which allows the different SMs communicate and share data with each other, without going to main memory.
NVIDIA included a lot of local memory into the new TPC units. Each SM has a small amount of local memory so each SP can communicate and share data effective, and all three SMs are attached to a local L1 cache to further improve efficiency.
NVIDIA also did a lot of work on the thread scheduler, and the GTX200 chip is able to support upwards of 30,000+ threads in flight. This is approximately 2.5X as many threads as the previous generation of GPUs from NVIDIA. The texture units do not go to waste in computing scenarios, and they can be used to “filter” data as well. One example of this is using the filtering and address units to do work on pictures that are being manipulated (for example in Photoshop). There are other functions that these units can potentially do as well.
Finally there is a L2 type cache that allows the different TPCs to communicate with each other as well. NVIDIA has essentially doubled the caches and register file sizes to improve the efficiency of the architecture. If you remember, most CPUs are comprised of about 80% cache and 20% logic. The primary reason for all this cache is to keep the computing pipeline(s) full of data and instructions. If a CPU or GPU has to go to main memory all the time for this data, then performance decreases dramatically as the latency between requesting the data and actually getting it is killing the efficiency of the processor. If most of the data is right at hand, then the units can be kept busy for a large majority of the time, thereby increasing overall performance.
The main memory controller has also been improved and expanded upon as well. The old G80 chip featured a 384 bit path to main memory, and the new GTX200 takes that to a full 512 bit. While AMD/ATI introduced the first consumer level board with a 512 bit memory path, the chip seemingly was not able to fully utilize all that bandwidth. With the GTX200, that is not necessarily the case. In heavy stream usage, main memory is likely saturated.