RV770 Architecture - a lot has changed
Now that we have stepped through a high-level look at how AMD has positioned the RV770 design we now can dive head first into the technical guts that push the GPU to the performance levels it was able to achieve.
Click to Enlarge
This very large, very complicated block diagram contains all the component detail of the RV770 core that we will break down as we progress through the architecture.
The RV770 core consists of 10 SIMD cores each of which contain 80 stream processors for a total of 800 SPs total.
This shot shows two of those SIMD cores being fed by the threaded dispatcher; each of the 10 cores includes a 16KB block of memory for local data sharing and has its own logic for control of a shared set of threads. Past the local data share is a block of 4 dedicated textures units and dedicated L1 cache and all 10 SIMD cores have access to a global cache for data sharing.
A close of up one of the blocks of 5 shader processors shows four single precision and a single double precision capable SP as well as some general purposes data registers and a branch execution unit that was first introduced with the R580 architecture. AMD still claims the performance lead when it comes to raw double precision processing power (about 240 GigaFLOPS) as well as a 40% increase in performance per mm^2 die space by shrinking these SPs dramatically.
Each of the 10 SIMD cores has four dedicated texture units for a total of 40 on the RV770 design. These newly redesigned components double the texture cache bandwidth provided by the HD 3000-series of cards and enable a 2.5x performance increase in 32-bit filtering rates. The combination of 40 texture units and four address processors allows the HD 4800-series of cards to run 160 texture fetches total per clock.
The texture processors on the RV770 also encompass a completely new cache design that aligns the L2 cache with the memory channels much in the same way the new NVIDIA GT200 design has opted to go. The L1 caches, as we mentioned above, store data unique to each of the 10 SIMD cores resulting in a 5x increase in effect L1 storage over RV670. The cache bandwidth has also increased dramatically for as much as 480 GB/sec of L1 texture fetch bandwidth. Based on the clock speeds of the two RV770 products being introduced today, it all adds up to a 652 Gtex/sec and 783 Gtex/sec texture performance rate.
The Render Back-Ends (or ROPs as others call them) saw some changes as well including a focus on improving the AA performance per mm^2 die size. There are four Render Back-Ends that match up with the four memory channels that handle all the color and AA action the RV770 can handle. In both 32-bit and 64-bit versions of MSAA the RV770 has basically doubled the performance - up to 16 pixels per clock for 2x and 4x samples and up to 8 pixels per clock for 8xAA.
Of course these updated back-ends continue to support both fixed function and programmable (CFAA) modes though I have honestly yet to see anyone adopt the potential for custom filters yet.
There is at least one new CFAA filter option being introduced with the RV770 architecture: an updated edge detect AA option that delivers 12x and 24x modes. The benefit to this version is that blurring can be avoided by taking samples ALONG the edges of aliased boundaries instead of across them. This will apparently work with adaptive AA and should only use as much memory as standard 4x and 8x MSAA options.
Click to Enlarge
Above is an example of how the new 24x Edge Detect CFAA filter can affect your image quality -- we'll be testing out CFAA features very soon.
Probably the biggest shift from the RV670 comes in the form of the RV770's memory controller; shunning the ring-bus memory controller that was introduced with the R580 (that was supposedly going to scale for years into the future) the RV770 returns to a more traditional distributed hub design. Unlike the RV670 that had a 512-bit memory bus the RV770 has four independent 64-bit controllers totaling 256-bits wide.
Each of the four controllers sits near a block of render back-ends and primary L2 cache, which are the biggest bandwidth consumers. The lower speed interfaces, PCI Express, CrossFire and the display controllers, connect through the same controller hub.
Though the new memory controller is only 256-bits wide, it does support both GDDR3 and GDDR5 memory technologies. The move to GDDR5 is an important one for AMD as it helps to provided all 800 shader processors with enough bandwidth in order to compete with the new 512-bit memory controller that NVIDIA used on the GT200. Providing over 115 GB/sec of bandwidth on a 256-bit wide bus is pretty impressive considering the GT200, using a 512-bit bus, can push about 141 GB/sec of bandwidth.
All this, and a little bit more we'll touch on later, make up the RV770 and the future of AMD's GPU technology. Now let's see the products AMD built around this design.