Review Index:
Feedback

AMD Fusion System Architecture Overview - Southern Isle GPUs and Beyond

Author:
Manufacturer: AMD

Introducing the AMD FSA

At AMD’s Fusion 11 conference, we were treated to a nice overview of AMD’s next generation graphics architecture.  With the recent change in their lineup going from the previous VLIW-5 setup (powered their graphics chips from the Radeon HD 2900 through the latest “Barts” chip running the HD 6800 series) to the new VLIW-4 (HD 6900), many were not expecting much from AMD in terms of new and unique designs.  The upcoming “Southern Isles” were thought to be based on the current VLIW-4 architecture, and would feature more performance and a few new features due to the die shrink to 28 nm.  It turns out that speculation is wrong.

View Full Size

In late Q4 of this year we should see the first iteration of this new architecture that was detailed today by Eric Demers.  The overview detailed some features that will not make it into this upcoming product, but eventually it will all be added in over the next three years or so.  Historically speaking, AMD has placed graphics first, with GPGPU/compute as the secondary functionality of their GPUs.  While we have had compute abilities since the HD 1800/1900 series of products, AMD has not been as aggressive with compute as has its primary competition.  From the G80 GPUs and beyond, NVIDIA has pushed compute harder and farther than AMD has.  With its mature CUDA development tools and the compute heavy Fermi architecture, NVIDIA has been a driving force in this particular market.  Now that AMD has released two APU based products (Llano and Brazos), they are starting to really push OpenCL, Direct Compute, and the recently announced C++ AMP.

Continue reading for all the details on AMD's Graphics Core Next!

The dream for AMD is to provide just as much compute power as graphics power, all up and down their product lineup.  The idea of the onboard GPU acting as a very powerful co-processor for parallel loads would give AMD a competitive edge against its larger rival, Intel.  But to achieve truly heterogeneous computing, a lot of work needs to be done at the silicon level as well as the OS support and development environment.  With this latest architecture, AMD is taking the very important and large steps required to achieve truly heterogeneous computing.  The future for AMD is the Fusion System Architecture.

Goodbye VLIW, Hello Vector + Scalar

When NVIDIA introduced the G80, they took a pretty radical approach to GPU design.  Instead of going with previous VLIW architectures which would support operations such as Vec4+Scalar, they went with a completely scalar architecture.  This allowed a combination of flexibility of operation types, ease of scheduling, and a high utilization of compute units.  AMD has taken a somewhat similar, but still unique approach to their new architecture.

View Full Size

Instead of going with a purely scalar setup like NVIDIA, they opted for a vector + scalar solution.  The new architecture revolves around the Compute Unit, which contains all of the functional units.  The CU can almost be viewed as a fully independent processor.  The unit features its own L1 cache, branch and MSG unit, control and decode unit, instruction fetch arbitration functionality, and the scalar and vector units.

The vector units are the primary workers in the CU when it comes to crunching numbers.  Each unit contains four cores, and allows for four “wavefronts” to be processed at any one time.  Because AMD stepped away from the VLIW5/4 architectures, and have gone with a vector+scalar setup, we expect to see a high utilization of each unit as compared to the old.  We also expect scheduling to be much easier and efficient, which will again improve performance and efficiency.  The scalar unit will actually be responsible for all of the pointer ops as well as branching code.  This particular setup harkens back to the Cray supercomputers of the 1980s.  The combination of scalar and vector processors was very intuitive for the workloads back then, and that follows onto the workloads of today that AMD looks to address.

The combination of these processors and the overall design of each CU gives it the properties of different types of units.  It is a MIMD (multiple instructions multiple data) in that it can address four threads per cycle per vector, from different apps.  It acts as a SIMD (single instruction multiple data) much like the previous generation of GPUs.  Finally it has SMT (symmetric multi-threading) in that all four vector cores can be working on different instructions, and there are 40 waves active in each CU at any one time.  Furthermore, as mentioned in the slide, it supports multiple asynchronous and independent command streams.  Essentially the unit is able to work on all kinds of workloads at once, no matter what the source of the data or instructions are.

View Full Size

AMD did not detail how many CU units we would see in upcoming products, but we certainly expect the performance of these next generation parts to overshadow the current 40 nm Cayman based cards.  They also did not detail how these would be arranged, how these will interact with texture units, or what we can expect from any new and interesting innovations with the ROPS.

Memory and Caches

One area that AMD did detail extensively was the changes in the internal cache, as well as their push for fully virtualized memory.  Each CU has its own L1 cache divided into data, instruction, and load/store.  The GPU then has shared L2 cache which is fully coherent.  Each L1 cache has a 64 bit interface with the L2, and once this scales in terms of both CU count and GPU clockspeed, we can expect to see multiple terabytes per second of bandwidth between the caches.  The L1 caches and texture caches are now read/write, as compared to the read only units in previous architectures.  This is a big nod not only to efficiency and performance, but also the type of caches needed for some serious compute type workloads.

The next level of memory support is that of full virualization of memory with the CPU.  Previous generations of products were limited to what memory and cache were onboard each video card.  This posed some limitations on not just content in graphics, but were also problematic in compute type scenarios.  Large data sets proved to be troublesome, and required a memory virtualization system which was separate from the CPUs virtual memory.  By adopting x86-64 virtual memory support on the GPU, this gets rid of a lot of the problems in previous cards.  The GPU shares the virtual memory space, which improves data handling and locality, as well as gracefully surviving unhappy things like page faults and oversubscriptions.  This again is aimed at helping to improve the programming model.  With virtual memory, the GPU’s state is not hidden, and it should also allow for fast context switches as well as context switch pre-emption.  State changes and context switches can be quite costly, so when working in an environment that features both graphics based and compute workloads, the added features described above should make things go a whole lot smoother, as well as be significantly faster, thereby limiting the amount of downtime per CU.

View Full Size

It also opens up some new advantages to traditional graphics.  “Megatextures” which will not fit on a card’s frame buffer can be stored in virtual memory.  While not as fast as onboard, it is still far faster than loading up the texture from the hard drive.  This should allow for more seamless worlds.  I’m sure John Carmack is quite excited about this technology.

June 16, 2011 | 07:31 PM - Posted by bjv2370

wow this architecture is promising.

June 16, 2011 | 07:35 PM - Posted by Josh Walrath

It will be interesting to see how they organize the rest of the chip.  We know from a previous presentation at Fusion 11 that primitive setup will be pretty flexible, and different chips will have a different amount of units doing this setup.  Plus how many CUs will be put in a larger functional unit and how is that tied into texture units?  Lots and lots that they left uncovered.

June 17, 2011 | 06:17 AM - Posted by Tom Hammond (not verified)

this makes a lot of sense.

What AMD has essentially done is broken down the CPU into compute units that can either act as x86+AVX scalar units or as GPU compute units.

the vector units are 16-wide, which means they can be used as SIMD units for GPU tasks or as AVX-SIMD units for the x86 scalar unit.

It is likely that we will see this first 512-bit AVX implementation (16-wide) as we would see half the units being idle if AMD stuck with a 256-bit AVX implementation (8-wide).

We can expect to see a very efficient use of silicon with this architecture as the CPU can tune the ratio of x86 vs compute units depending on the workload, potentially hundreds if not thousands of times per second.

- Tom Hammond

June 21, 2011 | 01:43 PM - Posted by nabokovfan87

From the 3rd to last and last slide it seems like everything is going to be an "apu" now, even though it really isn't. Makes me wonder how confusing the future will be, but this is why AMD is a very cool company. They don't sit on laurels, they do things, they change things, they try things, which makes it all a lot more interesting to keep track of things.

I wonder how much of a boost these cards will see. I know the die shrink helps, but this whole new architecture with the cache and memory could either help or only help in things like folding where it is calculation intensive. Who knows, finger's crossed and glad as heck that I waited out the 5xxx and 6xxx series.

June 25, 2011 | 12:52 AM - Posted by ThatTankIsBillsBrother (not verified)

I agree I still have a 775 CPU from intel. If AMD came pull this off with good real world numbers I will switch over to the new hybrids. It all comes down to high gaming numbers though. I go with switch company has the best scores weather its cheaper or not. I do have a 5850 in my system at the moment and one of the best GPU's I have ever had.

July 10, 2011 | 05:24 AM - Posted by Anno (not verified)

I'm really curious about how it will all turn out. Does anyone have a clue yet about density, for instance? I'm not much of a chip engineer, but the main advantages of VLIW 5 and 4 to me seemed that they could cram so many ALUs into their chips. In getting rid of that, will they be able to maintain a similar performance per mm^2? Will the increased utilisation be enough?

November 7, 2011 | 05:40 PM - Posted by CyberAngel (not verified)

Actually a CU *IS* VLIW 4/5
if you want to look at it that way.

IOMMU support from M$CPU/GPU virtual memory in Fermi2
AND that would then force also Intel to follow
It looks like Nvidia is missing a license for the x86
BUT at least Tegra3 ia a winner
The Future looks very exciting
and I'm waiting for the Haswell to be my future CPU
but what will my dGPU?

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.