AMD Brazos and Zacate Architecture Preview - Bobcat Explored
The Bobcat Architecture Overview
Low power is the future of computing. It is already a significant mover in the industry today as the need for more powerful, yet energy efficient devices grows and grows. There are two huge factors pushing for these devices. The first is the smart phone market which requires processing power that is described in milliwatts, but can still push complex programs and deliver a compelling visual experience at the same time. The other is that of near ubiquitous connectivity to the internet, and the need to actually be productive or be entertained while not having to carry a large and unwieldy notebook.
A high level overview of the Bobcat architecture. One giant step away from the classic Athlon...
This second reason is the focus of our coverage on AMD’s upcoming Ontario and Zacate APUs. While smartphones are helping to push the market, more traditional computing resources are still needed for a large portion of people working in the corporate world, as well as those from other walks of life who need (and want) to be connected and have the use of a fully featured notebook. AMD is making a gigantic push into the ultra-portable and low power markets, and their strategy to get there seems like a solid one.
Pure processing power is no longer holding most people back in day to day applications. It only takes so much CPU power to open a spreadsheet and recalculate a few worksheets. But where the industry is going is in a more visually rich computing environment which taps into the primary way that we humans receive information from the world around us. We must also look at the first steps towards heterogeneous computing, which leverages the strengths of serial computing (traditional CPU) with that of parallel computing (GPU). AMD has its first generation of APU processors ready to go, and we take our first real look at them.
A New Processor Architecture
The latest Phenom II and Opterons from AMD are all based on the same basic architecture which powered the original Athlon back in the late 90’s. This is not to say that they are identical, because some significant changes have taken place over the past decade. The architecture of today is far more powerful and scalable than that introduced way back in 1999. But when looking from a high level abstraction, there are striking similarities between the old and the new.
The top line represents the clock cycles. They don't call it "out of order" for nothin...
The general layout of these processors start with the large L1 I and D caches, which are 64 KB in size. The L2 cache has been included since the “Thunderbird” revision on the 250 nm process, going from 256 KB and up to 1 MB in some current processors. There are three complex decoders that feed three integer units as well as three separate floating point and SSE units. Branch prediction has received several makeovers through the years, but has typically been on the back burner as compared to unveiling newer and more powerful features (such as the 128 bit wide FP-SIMD units in the Phenom). Integration of HyperTransport links and an integrated memory controller round out the major changes we have seen throughout the lifespan of this architecture. But again, much of the data flow and larger features have gone relatively unchanged, considering how long the basic architecture has been around.
The old architecture could be scaled down into lower power applications by limiting the clockspeed and working with the process technology to limit leakage as much as possible. But that only goes so far, and soon these types of gains hit a wall as the processor will still require a certain amount of power to function at any clockspeed. We also hit a wall in terms of workloads, as large portions of the processor are underutilized, yet still draw power. This can be mitigated through aggressive clock gating, but this impacts the transistor budget as more transistors are needed to enable this functionality, which then takes away the budget for actual computing resources.
AMD focused on creating an architecture that was powerful, but could share resources more effectively, yet still achieve very low power consumption. What we see today is a radically different architecture from AMD than anything they have attempted before. This will not be a powerhouse CPU, but it is one that can provide adequate performance along with tremendous power savings. There are two official product families based on the architecture, and they are the 19 watt TDP Zacate (high performance) and the 9 watt TDP Ontario (extreme low power). What is perhaps most impressive about these products is that the TDP’s described also include a robust GPU attached to the CPU. Consider that most of today’s integrated graphics chipsets consume 10 to 15 watts TDP by themselves, it is impressive to consider that AMD is providing a CPU and GPU at those power ranges.
Note how much die space the I-cache, Inst TLB/Tag, and Branch Predict take up as compared to the integer and floating point units.
AMD put a lot of work in making a power efficient, yet flexible branch predictor unit. The thinking behind this is that if a better prediction unit is created, then less work in the processor will be done running down invalid branches. While the branch predictor is doing its work, it only clocks the necessary structures for the job at hand. Throughout the entire chip, AMD instituted a very aggressive clock gating scheme, and no one portion is overlooked.
Taking a step away from the Athlon architecture, AMD implemented a dual decoder stage, rather than their more traditional three complex decoder unit. It can decode two x86 instructions per clock, and 89% of all x86 instructions can be passed as a single micro-Op, while another 10% can be expressed as two micro-Ops. The final 1% of instructions (non-common) are then microcoded. The decoder feeds both the integer and floating point/SSE units.
Bobcat features two ALU units, which is fed from a dual ported scheduler. There are also separate Load and Store units also fed by a dual ported address scheduler. In more traditional CPUs, a lot of power is used by shuffling data throughout the CPU. By using a Physical Register File, a lot of the data movement is avoided, thereby increasing power efficiency with a minimal loss of performance. On the floating point side there are two floating point units. One is a floating point add unit, and the other is a floating point multiplication unit. Each unit can perform two single precision operations per clock (eg. 2 adds and 2 muls), and the MMX and logical units are replicated in each stack. This is a much simpler FPU as compared to the current Phenom II and the upcoming Bulldozer, but it is still powerful enough to motor through most desktop applications which require any kind of FP or SSE work.
The load/store, D-cache, and prefetch units also received a lot of work to maximize per clock performance, and still be power efficient. The prefetch unit is an advanced 8 stream unit. The load/store unit is out of order, so it can maximize time and clocks by grabbing and storing instructions and data without having to wait for results from previous instructions.
Finally we have the L2 cache, which is a roomy 512 KB unit that is 16 way set associative. It runs at ½ clockspeed, which saves a huge amount of power. Typically caches are the greatest source of power consumption and heat production in a modern CPU, as those transistors have to be powered and clocked in order to preserve the data held within them. Plus that data is constantly being accessed, written, and erased. By taking a relatively small performance hit, AMD is saving a significant amount of power.
The memory controller on these chips is shared between the CPU cores and the graphics portion. As such it was not covered in the Hotchips presentations from AMD.
When it comes to graphics, we seem to know even less so far. We do know that it will be a DX11 based unit, and that it should be about double the overall performance of the current integrated solutions around the industry. We also know that it will be more flexible and programmable in terms of OpenCL and GPGPU applications due to the design, and software work that AMD is putting into it. Once we get closer to the official launch, AMD will cover these areas in much more detail.