Review Index:

Bulldozer at ISSCC 2011 - The Future of AMD Processors

Subject: Processors
Manufacturer: AMD

AMD at the ISSCC 2011

The more I read about Bulldozer, the more impressed I am becoming.  AMD has barely been able to keep up with Intel for the past 6 years, which is disappointing considering the success they found with the original Athlon and then Athlon 64.  Since Intel introduced the Core 2 series, AMD has only been able to design and sell processors that would sometimes achieve around 90% of the overall performance of Intel’s top end processors.  AMD also suffered from die size and performance per watt disparities as compared to Intel’s very successful Core 2 and Core i7/i5/i3 processors.  The latest generation of Sandy Bridge based units again exposed how lacking in overall design and performance AMD’s Phenom families of chips were.

Before we all go off the deep end and claim that AMD will surpass Intel in overall performance, we need to calm down.  We are getting perilously close to the very limits of IPC (instructions per clock) with current technology and designs.  It appears to me that with the Bulldozer architecture, AMD should reach parity with Intel and their latest generation of CPUs when it comes to IPC per core.  Where AMD could have an advantage over Intel are in several specific categories.

Fetch, decode, L2 cache, and a beefy floating point/SIMD unit are all shared.  Since the majority of CPU work is integer based, AMD has implemented two complete integer units per CPU module.

At this year’s ISSC, AMD presented several papers on the Bulldozer architecture.  These cover in depth about a handful of features that the new core will bring to AMD’s lineup.  The first portion is essentially a refresh of what information we have been given last fall about the overall architecture of the Bulldozer core.  The second looks at the changes in the schedulers which feed the integer execution unit.  The final paper covers the power saving techniques.

A Clean Sheet Design

Bulldozer brings very little from the previous generation of CPUs, except perhaps the experience of the engineers working on these designs.  Since the original Athlon, the basic floor plan of the CPU architecture AMD has used is relatively unchanged.  Certainly there were significant changes throughout the years to keep up in performance, but the 10,000 foot view of the actual decode, integer, and floating point units were very similar throughout the years.  TLB’s increasing in size, more instructions in flight, etc. were all tweaked and improved upon.  Aspects such as larger L2 caches, integrated memory controllers, and the addition of a shared L3 cache have all brought improvements to the architecture.  But the overall data flow is very similar to that of the original Athlon introduced 14 years ago.

As covered in our previous article about Bulldozer, it is a modular design which will come in several flavors depending on the market it is addressing.  The basic building block of the Bulldozer core is a 213 million transistor unit which features 2 MB of L2 cache.  This block contains the fetch and decode unit, two integer execution units, a shared 2 x 128 bit floating point/SIMD unit, L1 data and instruction caches, and a large shared L2 unit.  All of this is manufactured on GLOBALFOUNDRIES’ 32nm, 11 metal layer SOI process.  This entire unit, plus 2 MB of L2 cache, is contained in approximately 30.9 mm squared of die space.

It is well known that Bulldozer embraces the idea of “CMT”, or chip multi-threading.  While Intel supports SMT on their processors, it is not the most efficient way of doing things.  SMT sends two threads to the same execution unit, in an attempt to maximize the work being done by that unit.  Essentially fewer cycles are wasted waiting for new instructions or resultant data.  AMD instead chose to implement multi-threading in a different way.  For example, a Bulldozer core comprised of four modules will have eight integer execution units, and four shared 2 x 128 bit floating point/SIMD units.  This allows the OS to see the chip as an eight core unit.

CMT maximizes die space and threading performance seemingly much better than SMT (it scales around 1.8x that of a single core, as compared to 1.3x that using SMT), and CMP (chip multi-processor- each core may not be entirely utilized, and the die cost of replicating entire cores is much higher than in CMP).  This balance of performance and die savings is the hallmark of the Bulldozer architecture.  AMD has gone through and determined what structures can be shared, and what structures need to be replicated in each module.  CMT apparently only increases overall die space by around 5% in a four module unit.

A closer look at the units reveals some nice details.  Note the dual MMX (SIMD-Integer) units in the FP/SIMD block.  A lot of work has been done on the front end to adequately feed the three execution units.

Gone is the three pipeline integer unit of the Athlon.  Bulldozer uses a new four pipeline design which further divides the workloads being asked of it.  These include multiply, divide, and two address generation units.  Each integer unit is fed by its own integer scheduler.  The decode unit which feeds the integer units and the float unit has also been significantly beefed up.  And it had to be.  It is now feeding a lot more data to more execution units than ever before.  The original Athlon had a decode unit comprised of 3 complex decoders.  The new design now features a 4 decode unit, but we are unsure so far how the workload is managed.  For example, the Core 2 had a 4 decode unit, three of which were simple decode, and the fourth was a complex.  My gut feeling here is that we are probably looking at three decoders which can handle 80 to 90% of the standard instructions, while the fourth will handle the more complex instructions which would need to be converted to more than one macro-op.  While this sounds familiar to the Core 2 architecture, it does not necessarily mean the same thing.  It all depends on the complexity of the macro-ops being sent to the execution units, and how those are handled.

The floating point unit is also much more robust than it used to be.  The Phenom had a single 128 bit unit per core, and Bulldozer now has it as 2 x 128 bit units.  It can combine those units when running AVX and act as a single 256 bit unit.  There are some performance limitations there as compared to the Intel CPUs which support AVX, and in those cases Intel should be faster.  However, AVX is still very new, and very unsupported.  AMD will have an advantage here over Intel when running SSE based code.  It can perform 2 x 128 bit operations, or up to 4 x 64 bit operations.  Intel on the other hand looks to only support 1 x 128 bit operation and 2 x 64 bit operations.  The unit officially supports SSE3, SSE 4.1, SSE 4.2, AVX, and AES.  It also supports advanced multiply-add/accumulate operations, something that has not been present in previous generations of CPUs.

In terms of overall performance, a Bulldozer based core should be able to outperform a similarly clocked Intel processor featuring the same number of threads when being fully utilized.  Unfortunately for AMD, very few workloads will max out a modern multi-core processor.  Intel should have a slight advantage in single threaded/lightly threaded applications.  AMD does look to offset that advantage by offering higher clocked processors positioned against the slower clocked Intel units.  This could mean that a quad core i7 running at 3.2 GHz would be the price basis for a 4 module Bulldozer running at 3.5 GHz.

Exact specifications have not been released for the individual parts, but we can infer a few things here.  First off is the fact that it appears as though each core will utilize 2 MB of L2 cache.  This is quite a bit of cache, especially considering that the current Phenom II processors feature 512 KB of L2 cache per core.  Something that has allowed this to happen is buried in GLOBALFOUNDRIES 32 nm SOI process.  They were apparently able to get the SRAM cell size down significantly from that of the previous 45 nm process, and allow it to also clock quite a bit higher.  This should allow more headroom for the individual cores.  With the shrink, we should also expect to see at least 8 MB of shared L3 cache, with the ability to potentially clock higher than the 2 GHz we see the current L3 caches running at.