Diamond Multimedia Viper HD 2900 XT 1GB GDDR4 Review
A Brief Historical Background
If you can sit back and reminisce with me for just a bit...the date was early in 2007 and the world was a-flutter with rumors of ATI's R600 GPU architecture. The rumor mills were in full swing and both a Radeon HD 2900 XT and HD 2900 XTX were said to be in the works. The difference? 1GB of GDDR4 memory on the XTX and 512MB of GDDR3 on the XT. But when the product FINALLY launched in May, there was no XTX model to be seen and the top end HD 2900 XT did indeed have 512MB of GDDR3 on-board.
That left many of us wondering what happened to the XTX model and the 1GB revision we were looking forward to. AMD basically told us nothing of the matter and it wasn't until the end of June when the idea popped up again; this time Diamond was the source. They were going to be sending us a pair of these illusive 1GB HD 2900 cards. Still called the HD 2900 XT, would the added memory and memory speed be enough to help AMD's product against the ever-tightening grasp of NVIDIA's 8-series of graphics cards?
A Brief History - the R600 Architecture Revisited
The following is a slightly condensed version of the R600 architecture evaluation that originally was seen during the AMD Radeon HD 2900 XT launch article. If you have already read it, or want to jump straight to the info on the new Diamond Multimedia Viper HD 2900 XT 1GB card itself, please do so.
The R600 architecture is a major step beyond the R580 architecture we saw in ATI's X1900 XT graphics cards and lineup that jumps down the road of unified shaders, a route required by Microsoft's Direct X 10 and first implemented in the G80 architecture from NVIDIA. AMD claims that the new HD 2000 series of graphics cards will take the best features of the X1000 series (like dynamic branching and stream computing) and combine them with advantages from the Xbox 360's Xenos GPU (such as unified shaders and stream out) and add new technology (like DX10 support, a superscalar shader processor and updated dispatch processor) to make a truly amazing product.
At least that's what they claim.
Here is the very big, very complex diagram of the new architecture behind the R600, specifically the flagship HD 2900 XT card that we are reviewing here today. I'll attempt to break this down piece by piece with more detailed descriptions and explanations of what the various components we are seeing here actually do.
The ATI Radeon HD 2900 XT card (which again is the one shown above) includes 320 stream processors, 4 SIMD units, 4 texture units and 4 render back ends (ROPs). The modular design of the core allows the lower models like the HD 2600 and HD 2400 to have a subset of those units and run with the same features. We'll detail the product lineup towards the end of the detailed discussion.
The very first unit that data must pass through when entering the GPU is the command processor that is responsible for interacting with the stream of data from the graphics driver. AMD is claiming that the upgraded unit can offload as much as 30% of the CPU overhead in batch improvements from the driver thus allowing for better overall performance.
The setup engine is the unit responsible for preparing the data and organizing it for processing by the various SPUs (stream processing units). There are three functions that is mainly will do, one for each type of processing work the SPUs might do: vertex assembly and tessellation, geometry assembly and scan conversion for pixel shaders. Each of these can submit threads to the dispatch processor sitting below it.
The ultra-threaded dispatch processor is one of most complex parts of the new R600 architecture and has a lot of critical logic built into that the performance is dependent on. The main function here is to order and send to processing the various "threads" that have been created that include a list of instructions to be executed on some data (be in textures, pixels, etc).
The 320 stream processors are divided into four 80-SPU blocks that each depend on a set of arbiters and sequencers responsible for selecting the thread to submit. In connection with the SIMD arrays, the dual arbiters allow two operations at a time to be processed by the SPUs, thus indicating a superscalar architecture are work -- mostly. Each of these threads can be "bumped" and their states saved in order to allow more critical threads to pass through and resume later at the dispatchers whim. In part to aid in latency hiding, the ability for a thread to be "bumped" keeps it from stalling the pipeline if it is waiting for memory access before it can continue processing.
You will also notice on the right hand side there that dedicated caches exist for shader constants and instructions to allow for unlimited shader program length. On the left hand side of the SPUs you'll find the all important memory read/write cache that is what allows for interthread communication and is a requirement of the new DX10 architecture. The stream out feature allows the shaders to bypass the ROPs and color buffer or to output sequential data instead of bitmaps; useful for that whole GPGPU thing.
The SIMD arrays of the R600 architecture are still a bit confusing to me. AMD claims that they use VLIW (very long instruction word) design to include up to six operations (5 math and 1 flow control) and all six can be performed in parallel on each data element in that current thread. Note though that the vertex and texture fetching is done separately and does not attempt to take advantage of these SIMD benefits.
Here is a detailed shot of a single stream processing unit. You'll notice that in the diagram above (the big one) that there are only 64 of these in there; so where does the 320 SPUs AMD claims come from? Well, there are five SPUs inside this block and 5 x 64 = 320, so there you go. Now, some people (namely NVIDIA) will claim that the comparison to the G80's 128 SPUs is unfair because of the amount of functionality they produce. It's a very complicated debate that goes deep into the theories of CPU design so we'll leave it at this: AMD claims 320 SPUs but performance results are what matters.
In any event, the arranged 5-way superscalar shader processor can issue up to five FP MAD (floating point Multiply-Add) instructions, IF the thread is able to fill that wide of a processing request. Each SPU is fully 32-bit FP but still supports integer operations (of course). The little branch unit on the right is there to handle flow control commands and just as with the R580 can eliminate flow control performance overhead. Finally, the general purpose registers store the input/output data to share their processing results.
AMD is definitely proud of their SPUs and mentioned to us over and over how it could achieve 475 GigaFLOPS (floating point operations per second). This table shows you how the 2900, 2600 and 2400 and a desktop processor line up in potential for raw processing power. This is exactly why the GPU Folding@Home and stream computing initiatives are gathering so much steam.
The R600's four texture units each have eight address processors that execute shader instructions to control their lookups, twenty texture samplers that can fetch a single data value per clock and four floating point filter units that handle 64-bit and 128-bit bilinear filters. The HD 2900 and 2600 feature a two-level texture cache design (256kb for HD 2900 and 128kb for HD 2600) while the HD 2400 uses a single level cache.
AMD's texture units can do 64-bit HDR texture bilinear filtering at full speed and 128-bit FP textures at half speed. Interestingly, users that preferred the "High Quality" texture filtering setting in AMD's previous graphics generation will like to know that in the R600, that quality setting is now the default. NVIDIA's G80 architecture implemented that as well last year.
The render back-ends as AMD is fond of calling them are known as ROPs to the NVIDIA crowd and are responsible for post-processing, anti-aliasing and depth tests among other things. These can handle up to 32 pixels per clock in the stencil tests on the Radeon HD 2900 while the 2600 and 2400 models will only get 8 pixels per clock. A neat new feature is that the MSAA resolve functions are programmable and allows for a custom filter AA -- something that AMD was showing off later. We'll cover that on a later page as well.
The depth, stencil and compression functions on the render back-ends were improved significantly over the X1000 series:
This simple array of stencil shadow tests shows that the HD 2000 series is seeing a 25-95% gain depending on the application.
Overall I'd say that the R600 architecture is actually much more similar to NVIDIA G80 than I expected, but with the requirements of unified shaders and other features from DirectX 10, it's likely that being forced to meet them pushed both sets of engineers in the same general direction. AMD's R600 does feature a superscalar (versus the scalar architecture on the G80) and a new custom filtered AA option though that could help it be a success. We'll have to see the benchmarks first before passing out any congratulatory cigars.
Updated Ring-bus Memory Controller
The new R600 architecture is also sporting a new version of the ring-bus memory controller used on the R580 design. This time the interface is actually 512-bits, surpassing the 384-bit bus that the GeForce 8800 has. The larger bus width allows AMD to get more bang out of existing memory technology by giving more of the GPU access to it at the same time. Memory can also run slower and thus cooler and still maintain the same bandwidth values as previous generations, should they decide to go that route.
In this version of the ring bus, the controller is full distributed instead of having a central arbiter with crossbars as we saw in the R580. Because the design of these crossbars was not as a flexible as the ring bus design itself, they were removed in favor of a model that ONLY works on the "stop" of the memory ring.
The memory controller on the HD 2900 XT flagship card is divided up into eight separate 64-bit memory channels for a total of 512-bits. Each channel has access to portions of the frame buffer (64MB each in this case) and memory accesses are don by circling around the ring bus to each "stop" to find the proper location. While it may sound inefficient at first, the bus moves quickly enough that AMD's solution isn't suffering any kind of drawbacks and in fact may result in the largest memory bandwidth numbers we've seen.