Industry Dirt: GTX 460 Saves Bacon and a Trip to Southern Islands

Subject: Editorial
Manufacturer: AMD

A Summer Cruise Through Shaded and Tessellated Waters

    While I still haven’t been briefed on NVIDIA’s upcoming GTX 460 parts, there are a few things that I can safely talk about and be *mostly* correct.

    Fermi has been a rough architecture for NVIDIA.  The design goals were very aggressive, and a lot of them depended on a much better rollout of TSMC’s 40 nm process than was actually encountered.  Not only was TSMC’s 40 nm process late, but it had some serious issues which impacted designs from multiple partners.  AMD was able to avoid most of these pitfalls by taking a more conservative approach to GPU design, and to thoroughly experiment with well known designs on the 40 nm process (see the Radeon HD 4770).  Even then AMD had a struggle on their hands to get their HD 5000 series of chips to market in appreciable numbers.

    NVIDIA took a seemingly different approach, and was bitten quite hard for it.  First off they did produce a lower end 40 nm chip based on the previous generation, but it was significantly later in development than what AMD attempted with the RV740 (HD 4770).  So whatever lessons were learned about TSMC’s 40 nm process on this smaller, simpler, and more well known chip where not able to be included in NVIDIA’s Fermi architecture, which was too far into final development to change in any significant way.  AMD on the other hand was able to figure out some of the defects on the process, and built in workarounds.

GF100 is comprised of 4 GPCs, with 4 SM's per GPC.  Logically most would assume that a derivative part would feature two GPCs, but that would leave far too large of a gap between the top end designs and a midrange card.  Instead we will either see a 3 GPC design, or a 4 GPC design, but with 3 SMs per GPC.  Caches will likely be cut by 2/3.

    Even with workarounds in place, AMD was unable to meet demand for their parts until early 2010.  Though AMD was able to handle their designs a bit more adeptly than NVIDIA did, they still faced many of the same issues as their primary competitor.  In Q1 of this year it appears as though TSMC solved most of the problems with their latest, cutting edge process technology.  A year later than expected, that is.

    The Fermi GF100 chip is now finally in full swing, and there are good quantities of cards available to buyers.  Personally, I like where NVIDIA has gone with their architecture.  The way it is set up provides a very high utilization rate for the stream processors (CUDA cores), and also able to switch workloads with the lowest possible latency (eg. changing from pixel shading to stream computing to tessellation, etc.).  The main problem of course is that it is almost too big of a chip for the process available.  I think Fermi will really shine at 28 nm, but at 40 nm it is very reminiscent of ATI’s R600.  Remember back to the HD 2900 XT days when that card was notorious for being “hot and power hungry”?  Sound familiar?  Well, that particular issue was fixed when ATI went from TSMC’s 80 nm HP process to the much improved 55 nm process.  The revised HD 3750 had the same overall performance, but at one half the power consumption.

    Essentially what NVIDIA had to do was make minor revision changes to the GF100 to get it running “good enough” for the limitations of the 40 nm process, and were unable to do the major revisions needed due to time-to-market pressures.  By the time the major issues with TSMC were known, it was too far down the design pipeline to fully optimize the chip for the process.

The Polymorph Engine is the feature that NVIDIA is really pushing.  Instead of having a standalone tessellator that services the entire chip, NVIDIA has multiple tessellator units spread throughout the chip which leverages the CUDA cores to do the work.  This allows a more powerful and scalable tessellator as compared to AMD.

    The upcoming GTX 460 video card is going to be based on the GF104 chip, and it will not face the same issues as the GF100.  By Spring of 2009 the limitations of TSMC’s 40 nm process became widely known and recognized.  While too late to make the major modifications for the huge and highly complex GF100, NVIDIA was able to refocus their subsequent design work on derivative parts to work around those limitations.

    The GF104 is first off a much smaller part.  Exact specifications are not widely known, but it appears to be around 2/3 to 3/4 the size of a full GF100.  Most importantly, NVIDIA was able to change the design around to account for issues with via quality on TSMC’s process, as well as the well known variation in channel length that are well beyond spec.  TSMC has done a lot of work to improve these issues at the process level, but in this case it didn’t hurt to adjust the design to take these into account.

    GF104 (and GTX 460) looks to be what Fermi needed.  It should be far more power efficient due to changes in both design and smaller die size, and better able to meet clock speed bins without breaking the power budget.   These chips are apparently flowing quite nicely out of the Fab, and NVIDIA is preparing to release a lot of boards onto the market next month.  The design is aimed at the $199 to $249 market, which is a pretty meaty and influential price point.  Currently the Radeon HD 5830 exists in this spot, and it doesn’t particularly excel at this price.  The Radeon HD 5770 nearly matches the performance of this card, and is a full $60 to $70 cheaper.

    Performance of this card looks to come close to, or match the Radeon HD 5850, but at a lower price.  AMD will likely respond to this by lowering prices as well, and slowly phasing out the HD 5830.  HD 5770 and 5750 prices will see a small decrease, but not much because NVIDIA is still several months away from releasing their smaller chips at that price range.  One area that NVIDIA looks to really push is the case of running GTX 460s in SLI.  Two of these cards should give GTX 480+ performance at sub 480 prices.  We are also seeing a greater emphasis on SLI motherboards these days from partners, and in particular MSI looks to push SLI on AMD once the GTX 460 becomes widely available.

    The GTX 460 looks to save NVIDIA’s bacon with this 40 nm generation of parts.  Smaller die size, better yields, fewer heat and power concerns.  All wrapped up in a nice sub-$250 price.  Could this be NVIDIA’s new “G92”?  Perhaps.  Even though it may not reach the epic proportions of renaming and a significant lifespan, it certainly will allow a much greater penetration of Fermi technology into the marketplace.

AMD’s Southern Islands

    Be careful, there is some rampant speculation on this trip.  The only actual information that we have on this architecture is essentially relegated to two pieces.  It will still be based on TSMC’s 40 nm process, and the stream units from the Evergreen series of parts are kept intact.  What have changed are the “uncore” portions, such as scheduler, memory controllers, etc.  I think that we can look at the current limitations of the HD 5000 series and perhaps catch a glimpse as to what to expect.

    The area that probably deserves the most work is that of tessellation.  In its current form, all HD 5000 series of parts share the same monolithic tessellation unit.  This is in direct contrast to NVIDIA’s units, which leverage the stream units to be much more powerful and scalable, depending on the product.  When looking only at tessellation performance, the GTX 480 far exceeds that of the HD 5870.  It is almost backwards thinking to consider that the tessellation performance of the HD 5870 is identical to that of the HD 5400 series of cards.  It is not much of a jump to consider that this will be addressed with the Southern Islands products.  What exact form it will manifest is certainly in question.  Will it be multiple tessellation units attached to individual SIMD units?  Will it be a beefed up monolithic unit?

Cards such as the Asus EAH5750 and the MSI R5770 Hawk (top two) have proven to be quite successful, and AMD will nearly have their second generation of midrange/budget parts by the time NVIDIA releases their first generation competing designs.

    The second area which will probably get some attention is that of stream computing efficiency.  While the actual stream units will be unchanged from the previous generation, how the data is delivered to said stream units can be improved upon.  The raw floating point performance of the HD 5870 is impressive, and achieving single precision rates of 2.7 TFLOPS is amazing.  Unfortunately, actual throughput is typically less than even a previous generation NVIDIA GTX 285 in most GPGPU workloads.  AMD will not be able to work miracles by changing around the non-stream portions of the design, but they should be able to improve throughput and compatibility.

    Where else can AMD go?  Well, more work will likely be done on the memory controllers as well.  Will we see AMD go from a 4 x 64 bit units to 5 or 6 of them?  Good question.  The size of the chip is going to be larger due to the above mentioned changes, so it might not be unrealistic for AMD to improve memory performance with current generation GDDR-5 by increasing overall memory bandwidth by going to a 320 bit or 384 bit controller.  The larger the die, the more space for I/O pads on the design.  We also must consider that we will not see any big changes in GDDR-5 speeds throughout the rest of this year.

    Power and heat will remain within the same tolerances as current HD 5000 parts, even though the chips will likely be bigger than their previous counterparts.  TSMC’s 40 nm process is now much more optimized, and much more well known in terms of design issues, so AMD can further optimize the part so it does not break any thermal or power envelopes as currently defined.  While this will be the biggest part that AMD has introduced since the R600, it will still be much smaller than the NVIDIA GF100.

    We will not see these parts until at least late Q3 of this year, and perhaps early Q4.  But from all indications this refresh of the DX11 lineup for AMD will still be fairly significant.