NVIDIA GeForce GTX 480 and GTX 470 Review - GF100 and Fermi tackle gaming
GF100 arrives and brings Fermi in tow
We have been waiting for this day basically since GT200 was revealed: NVIDIA's next-generation GPU architecture. While AMD has been releasing card and card with DX11 support and Eyefinity gaming, NVIDIA has been forced to sit idly by and await its own hardware. The GF100 GPU is finally here as the GeForce GTX 480 and GTX 470 graphics cards and they are surprisingly potent solutions for gamers!
NVIDIA is calling today's release the "most anticipated GPU launch ever" and while that might be true, I don't think it is for all the reasons NVIDIA is hoping. We have been patiently waiting for the next architecture update from NVIDIA since the GT200 that we first reviewed in June of 2008. That is nearly two years ago - an eternity in the world of high performance computing and gaming. In that time we have rebrand after rebrand from NVIDIA and its partners with very little in the way of innovation. Meanwhile AMD has used this lapse to develop and release a DX11-ready graphics series ranging from $50 to $500 with support for new gaming options like Eyefinity.
As the delays for the GF100 architecture, aka Fermi aka GT300 (a LONG time ago), piled up, it became clear that NVIDIA didn't have the 100% winning design on their hands and they were being forced to make good with what they had. As it turns out, the design and architecture behind the two new graphics cards being released today, the GeForce GTX 480 and the GTX 470, have some great advantages over the 5000-series of cards from AMD; but also some noticeable deficits.
For those of you looking for a quick view of the GeForce GTX 480 and GTX 470, please check out the embedded video review here. Feel free to go full screen for 1080p!
Before we dive into the performance of the cards today, let's take a look back at all the information we have learned from NVIDIA leading up to today's launch.
Fermi / GF100 Architecture Highlights
One of the ways NVIDIA was attempting damage control with the GF100 release was to slowly trickle out details about the GPU and its new features ahead of availability. We first saw information about Fermi pop up during the GPU Technology Conference in September of 2009 - there we got details on the architecture and how it would apply to the world of GPU computing. After that, in the final days of CES this past January, NVIDIA spilled the beans on features like tessellation that would be so important for gamers and game developers with DX11.
Our very own Josh Walrath did the write up here on PC Perspective based on the GF100 data and some of that information is provided here below to bring those of you that might have missed out on some that data up to speed. The full article, with a lot more detail about the GF100 GPU architecture and how it affects gamers, can be found here and should be read in its entirety.
Fast forward to 2010 and we can look at the most advanced consumer graphics chip on the market; the AMD Cypress. This chip is comprised of 2.15 BILLION transistors running at a brisk 850 MHz. We are looking at a chip that is basically over 1000 times more complex than the Voodoo Graphics, running at a clockspeed 17 times faster. The AMD HD 5870 is by far the most advanced graphics card on the market at this time, and the first to bring DirectX 11 functionality. While AMD has a virtual stranglehold on DX11 cards with their Evergreen family of chips and cards, NVIDIA has not been standing still.
This is the logical diagram of the GF100. 4 GPCs, 16 SMs, and 512 CUDA cores comprise the cutting edge design for NVIDIA.
The primary issues that NVIDIA wanted to address with this design are as follows; image quality, geometry performance, effective tessellation performance, raw pixel pushing power for multi-monitor and 3D vision performance, and a huge leap in GPGPU capabilities and performance. Unfortunately for NVIDIA, this required a fairly major rethinking of how their architecture is laid out, how information is passed between the functional units, and how many more transistors that their goals would require to be used. Previously we have detailed what NVIDIA will do for GPGPU in Ryan’s Fermi preview, but that is only the tip of the iceberg when we look at the chip as a whole, from both graphics and GPGPU perspectives.
The ALU is a major change, mainly due to previous generations being able to do floating point only. The advantages that NVIDIA brings to the table with this are nicely summed up in their white paper detailing the GF100 architecture:
NVIDIA has made their GPU far more CPU-like with the addition of fully supported integer functionality. This will have little effect on gaming initially, but further down the road we may see more and more functionality that was once reserved for the CPU to be implemented directly on the GPU. Where it will make a huge initial splash will be in the GPGPU circles, in which the GF100 can be directly programmed to using various C++ level languages.
NVIDIA made another major change in that they included four TEX units in each SM. This improves overall efficiency for the architecture by more closely associating the TEX units to the shader units. They have also increased performance by clocking these TEX units much higher than what they have previously been. Each TEX unit is clocked at ½ the speed of the CUDA cores. So if we assume that NVIDIA is aiming for a 1.5 GHz clockspeed for the CUDA cores, then the TEX units will run at 750 MHz, which is still higher than what we see with current GT200 based cards clocked at 600 MHz to 650 MHz (non-overclocked).
The final and most significant portion of the improved SM units is the PolyMorph Engine. This is related directly to geometry throughput and tessellation. If we look at previous architectures, geometry performance has improved at a glacial pace as compared to pixel shading performance. NVIDIA figures that pure geometry performance has only increased by a factor of 3X from the GeForce FX 5900 to the GTX 285, but pixel shading performance has improved by 150X. The PolyMorph units are closely associated to the CUDA cores and SMs because of the workload that tessellation incurs. Geometry shading and tessellation requires a lot of work from the SM units in general, and that data is frequently passed between the SM and PolyMorph engine, depending on what stage of rendering is being done. There are five stages to each PolyMorph engine (Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output), and between each stage results are returned to the SM where further work is done, and that work is then sent to the next stage of the PolyMorph Engine. If NVIDIA had decided to slap a dedicated Tessellation unit to their previous designs, it would have incurred a huge latency penalty. By tightly integrating the PolyMorph engine into the SM, and providing 16 SM units per GF100 chip, NVIDIA expects to see a neat 4X+ improvement in tessellation performance as compared to the current HD 5870 cards.
NVIDIA has also tightly coupled the raster unit to each SM. After the PolyMorph engine has processed the primatives, they are sent to the raster engine. There are three stages to the raster engine, and these are Edge Setup, Rasterizer, and Z-Cull. This basically sets up the pixels that are going to be viewed, and discards those that will not. Once this is complete then post processing and pixel shading is done by the SMs. There are four raster units on the GF100, and each raster unit serves up to 4 SMs.
The next level of the architecture is the GPC units. Each GPC contains the four SM units and one Raster Unit. The four GPCs are then connected to a large 768 KB L2 cache. This L2 cache in previous generations was reserved for read only TEX data. The new L2 cache is now fully writable by the GPCs, TEX units, and ROP partitions as needed. This can dramatically cut down on main memory accesses for frequently used tex info and instructions/data.
The four GPCs share six 64-bit memory controllers, which results in a 384 bit memory bus supporting up to GDDR-5 memory. The host interface for the chip connects directly to the GigaThread Engine, which feeds instructions and data to the GPCs.
The one last area in which NVIDIA lavished a lot of attention are the ROPS. For a while there, we were primarily shader bound in most of our applications. Now that we have 512 CUDA cores spitting out a tremendous amount of pixels, we are no longer being held back. Instead we have seen a new performance bottleneck, and it is one we have seen before. Pixel fillrate is now very much back in demand, and that is due to the rise of multi-screen gaming and NVIDIA’s 3D Vision. The last generation of products from NVIDIA and AMD were able to handle resolutions of 2560x1600 with 4X to 8X anti-aliasing, but with the addition of multi-monitor gaming becoming more mainstream, combined with the double fillrate needs of 3D Vision, there needed to be a massive expansion of pixel fillrate capabilities from these cards.
The new Coverage Sample and Transparency AA solutions are a lot more effective than what we see in previous generations of parts.
Adding geometry to models is actually quite simple. The problem is getting good performance out of these multi-million triangle models on current hardware. The memory requirements for such models are phenomenal, which is why the industry has used normal maps applied to simple geometric models. The answer to this problem is tessellation.
Tessellation is the ability to create complex geometries, yet consume as little memory as possible. This is done by mathematically increasing the geometry of a model in realtime, rather than relying on a complex fully realized model that takes up a huge amount of memory and bandwidth.
What is essentially happening is that a “simple” model is being loaded into the card, but it comes with some complex mathematical formulas. These formulas, while complex, do not take up a huge amount of space in memory. In cases such as increasing the geometric complexity of a fixed model, these formulas are represented by displacement maps. The SMs then take these simple models, and using the PolyMorph engine and the CUDA cores, apply the displacement maps to the simple models and increase the geometrical complexity. So a 10,000 triangle model is then turned into a 1 million triangle model… all without increasing the memory footprint and bandwidth needs 100 fold even though the model is 100 times more complex. In situations such as fluid, displacement maps are not used, rather we go back to the complex mathematical formulas that a dictate fluid’s behavior in precise situations.
The way that NVIDIA has implemented the PolyMorph engine with the SMs should allow a 4x increase in tessellation performance over that of the competition. While tessellation in current DX11 games is limited to certain situations (for example in Dirt 2 tessellation is applied to the crowds watching the races as well as the water), it will become more and more common to apply these effects in future games once tessellating hardware becomes more prevalent.
To soak up all that extra performance that this generation of products will offer gamers, NVIDIA is taking the multi-monitor gaming idea that was originally started by Matrox, implemented correctly by AMD, and putting their own signature touch to it.
Not only will these new cards support up to (and possibly beyond) 3 monitors, it will also allow those extra monitors to use the 3D Vision technology that NVIDIA introduced last year. Early indications point to it working quite well, but frankly I worry about peripheral vision and the shutter glasses. The rods that primarily make up peripheral vision are far more sensitive to light changes than the cones which make up the center of the eye (fovea). Fatigue and nausea may be two of the big issues to potentially sink three monitor 3D gaming with shutter glasses. Keeping framerates at the 120 fps level for next generation content (120 jittered frames per second across three monitors at high resolution and AA levels) could easily strain even a triple SLI setup.
The idea is really neat, but the actual implementation could be tough to sell. A lot of it will revolve around what individual gamers can afford, as well as what they can physiologically be able to handle. Some people will be able to handle it fine, while others will just be too sensitive and will have a very negative reaction. This will definitely be a “try before you buy” scenario for most hard core gamers willing to invest in such a solution.
This was the data as we knew in it January; but some things have changed. Most importantly, we are NOT getting a 512 CUDA core graphics card option. At least not today. Let's look over the two new cards and their specs on the following pages.