AMD Fusion System Architecture Overview - Southern Isle GPUs and Beyond
More Memory and Cache Changes
This memory virtualization will be shared between discrete GPUs as well as integrated parts. The iGPUs of course have access to the CPU’s memory controller, and in the Fusion parts it is actually given priority over the CPU. As Eric Demers described it, current quad core CPUs really only take up around 8 to 12 GB/sec of bandwidth when fully loaded. This explains why we do not see a whole lot of performance increases as of late with modern processors and fast DDR-3. Going from 1333 MHz to 1600 MHz would often show no real performance improvement on the AMD side, and on the Intel side they do not even officially support speeds higher than DDR-3 1333. The iGPU changes that. AMD reworked their memory controller, and it can feed upwards of 30 GB/sec of data to the GPU. Stream benchmarks will not show this kind of utilization, but in testing there are very distinct performance improvements in graphics applications by going from 1333 speeds up to the top 1866 speed supported by the new AMD processors.
On the discrete GPU side, virtual memory will be accomplished by tunneling through the PCI-E connection. For both iGPU and dGPU to work in this manner, IOMMU must be used and supported in the OS. The platform is really the primary goal for all of these changes. By implementing these things into the platform, we should have better memory management plus the virtual memory. This will help to make programming easier, and therefore more approachable to developers who many not have the time or budget to address a more closed system. Parallelism between the CPU and GPU will be complimentary rather than antagonistic. The CPU can handle the more serial operations, while the GPU goes for the highly parallel. With the shared virtual memory space, the CPU and GPU can schedule work for each other. AMD wants to keep the platform open and work with other IHVs and ISVs to address their needs. Finally AMD has worked very hard to improve overall memory efficiency. Not just in the caches, but also the memory controllers and the virtualization. With each generation of cards starting with the HD 2900 XT, AMD has focused on adding compute enhancements, but at a rate which would not significantly impact die size or graphics performance. Here AMD has taken a big step towards far greater compute performance, but has also looked to improve graphics performance with these particular changes.
Double precision is again supported, but it is very flexible as compared to previous iterations. Peak double precision on older parts was typically 1/5th that of single precision, due to the nature of the VLIW-5 and VLIW-4 architectures. With the new vector based units, peak double precision is now up to one half that of single precision. But AMD has given itself the option to turn down that performance, depending on the product. For the top end FirePro cards, we would see one half. For the top end gaming card, we might see one fourth. For mainstream and integrated, we could see as low as one sixteenth. AMD has stated that all of its products based upon this architecture will have the ability to do double precision; it just is a matter of how much performance is enabled.
Graphics performance is still the primary goal of this new architecture. There will still be plenty of fixed function units which have not changed in ages. ROPS and Z-units will remain, and it is quite likely their numbers will grow with each shrink, thereby allowing more rendering power to push pixels to the screen. With performance sinks like Eyefinity and 3D, pixel fillrates are still very important.
We will see the first generation of parts come out in Q4 2011. These will be discrete GPUs. The current Llano CPUs are based on the older VLIW-5 architecture, and it looks like Trinity (Bulldozer + iGPU) will be based on the VLIW-4 architecture. After that though, we can expect the integrated parts to use this new architecture. This opens up a new realm of possibilities for AMD. One scenario discussed was that of physics acceleration. Instead of the dGPU doing both rendering and physics/compute work, the iGPU on the CPU would handle those. The iGPU would have the advantage of being located on the CPU, sharing the same memory controller, and accessing the main memory very quickly, as well as greater memory localization of the data. This would reduce latency by a significant degree as compared to the dGPU doing the same thing over the PCI-E bus. By taking care of this business, the dGPU would better handle other operations such as geometry and pixel shading or tessellation.
The first iterations of virtual memory would likely be featured on AMD only platforms. Intel would have to buy into this concept, and allow it to work on their CPUs and platforms. This would not be a simple driver addition to enable this functionality with Intel processors. AMD is committed to this being an open architecture. It will be interesting to see if NVIDIA jumps onboard. Currently NVIDIA does have a virtual memory mode for their GPUs, but it is not sharing it with the CPU, and it does not exist in the same virtual space.
This is a big deal for AMD. While they have had trouble keeping up with Intel on the CPU side, we can see that they have had no problems staying ahead with graphics. Their push towards heterogeneous computing is also shared with NVIDIA, and their combined efforts towards utilizing this functionality will benefit both in the long run. Intel is still more CPU-centric, but we are starting to see them taking a larger interest in this technology. The cancelled Larabee project may have been misguided in terms of addressing the gaming and graphics market, but the parallel computing ramifications of this part and its extreme programmability hint at things to come.