Review Index:

AMD Unveils Steamroller Improvements

Author: Josh Walrath
Subject: Processors
Manufacturer: AMD

Steamroller in Slightly Better Focus


Steamroller looks to improve upon what Piledriver has brought to market.  Unfortunately, for those very interested in the nuts and bolts of how this is accomplished, Mr. Papermaster is very general with this particular overview.  We do not know the exact details, or changes to current structures, which will be done to implement these positive performance improvements.  The first slide shows that AMD will not radically change around the overall design of Steamroller.  It will still feature the shared fetch and decode portions, the two integer execution units, and the single shared floating point/MMX/SSE/AVX structure.  The goals are obviously to feed the cores faster, improve single threaded IPC, and continue to push performance per watt.

View Full Size

The next slide gives us some decent information.  First off the fetch unit will receive some significant upgrades.  Branch mispredict will be improved upon, and the instruction caches will be bigger.  One of the big complaints about Bulldozer was the relatively small L1 data and instruction caches.  These will obviously be larger, and the latency will be improved upon.  The next big improvement will be dedicated decode units for each integer execution unit.  These decode units will also service the float unit, depending on thread priority.  AMD estimates these changes will lead to a 30% improvement in operations per cycle.

View Full Size

Single core execution looks to get a boost by essentially getting data to the execution units faster.  This includes better scheduling and more register resources without increasing latency.  The larger L1 d-cache will not just be bigger, but will handle data cache misses better and will have major improvements in store handling.  In the past AMD has talked candidly about where some of the weaknesses of Bulldozer were, and this is certainly an area that is receiving major improvements.

View Full Size

Steamroller will also receive a good performance per watt boost with the new design.  One of the areas that are of distinct concern is that of floating point efficiency.  The current design has a unit that shared, but is essentially single threaded.  Even though it has enough execution units to do 2 x 128 bit SSE4 operations, if the thread does not require two operations then the other unit is left idle.  With the dual decode units feeding the floating point unit, it appears as though AMD might be working on better interleaving of threads to improve efficiency and increase utilization of floating point resources.  This theoretically should allow the unit to finish work more quickly and then go to sleep faster.  This is obviously speculation on my part as AMD is not giving us nearly the information that we require (or at least desire).  L2 cache can also be resized depending on workload, so it does appear as though portion of the L2 cache can be put into a sleep mode if it is not required.  This again saves power.

View Full Size

One new area of interest for us when it comes to next generation architecture is that of the design tools that AMD is implementing.  Taking a big page from the GPU guys, AMD is utilizing more and more automated place and route.  This has advantages and disadvantages to any design.  Typically hand placed designs can be more dense and because the power properties are well know, tend to be more efficient and can be clocked to higher speeds (aka custom cells).  Automated place and route typically depend on “standard cell” designs which are easier to fit together, but typically take up more space, use more power, and clock much lower.  The “High-Density” cell library that AMD is using for some of their graphics work is now being applied to the CPU.  The example they use is that using the HD cell library decreases the size of the functional unit by 30% from the hand laid out version.  AMD claims that it not only saves on die space, but is more power efficient.  AMD did not comment on clocking ability though.

View Full Size

Finally AMD showcases their latest work with the SeaMicro folks.  SeaMicro was purchased by AMD and is well known for their dense and efficient server solutions.  These solutions typically utilize low TDP processors, but feature a unique interconnect fabric that is wide, powerful, and energy efficient.  AMD will be introducing Opteron based units utilizing this fabric technology.  The board, as shown here, is very small yet still very powerful.  Eventually AMD will implement Steamroller based APUs in the Opteron environment, and these products will be a primary focus for the SeaMicro fabric technology.

View Full Size

AMD is obviously not standing still, and they are working to improve the architecture that was introduced with Bulldozer.  AMD is doing things a bit different this time.  Instead of focusing on large, complex jumps in performance featuring long time frames between architectures, they are working on smaller steps to improve not just process technology, but overall design and implementation.  We do not know if this will help AMD catch up to Intel in overall performance and power consumption, but when considering the size of AMD and their R&D budget, this approach may allow them to stay more competitive without the risks of making a huge mistake with a comprehensive blank slate design.  This at least allows them to see what works, what can perform better, and what simply drags the design down.  With this information in hand, AMD can more quickly turn around and address issues with a particular design.

According to sources, AMD expects to introduce the first Steamroller parts in early 2H 2013.

August 29, 2012 | 09:22 AM - Posted by Crickets Chirping (not verified)

Barcelona. Bulldozer. ...

Does anyone know if USC plays Notre Dame this year?

September 11, 2012 | 10:39 AM - Posted by Anonymous (not verified)

Uhh...No. Do you know how many pumpkins it takes to make a pie?

August 29, 2012 | 10:27 AM - Posted by Anonymous Coward (not verified)

I do not remembered having ever heard complaints about bulldozer's thermals or clocking. All that stuck in my head was they had their parallel designs done well, which is arguably the harder/more important part to optimize first, but their IPC needed to get trimmed down to make their design really shine.

August 29, 2012 | 11:01 AM - Posted by Josh Walrath

The original design was expected to be a 4 module unit hitting around 4 GHz at 125 watts.  They were never able to get there.  After about 3.8 GHz in that 4 module part, TDPs started to jump really dramatically.  If you remember my FX-6200 review, power at the wall socket jumped up 100 watts when that chip was clocked from 3.8 GHz to 4.6 GHz.  The chips just were never able to hit the clockspeed/TDP targets that would have made them more competitive with not just Intel parts, but also their previous Phenom II products.

September 11, 2012 | 10:36 AM - Posted by Anonymous (not verified)

Staying below TDP is a relative term. 61C is the max Temp at 124watts for a 8150. The processor runs at 3.6, turbo all 8 cores to 3.9 and then 4 cores to 4.2.
However, with no voltage bump, my 8150 runs at 4.2 (turbo Off) and a cool 44.8C under full load, Thats 16C (60F) below TDP. (24/7)Indicating a lot of headroom to TDP
99.9% of the Overclockers use aftermarket cooling, that 4.2ghz is on a tower air cooler.
As far as watts, my video card bumps 100 more watts as well. So how many watts used is not the same as saying "it won't get there within TDP". Obviously it will and within TDP temp of (61C).
So I don't understand your statement. Unless you were referring to some insane Overclock target? AMD's current target is above any intel processor and far below on the cost.
The "4 modules" you refer to is an 8 core, then the FX-62oo you refer to is a 6 core.
I think if you compare $ targets first you see AMD as the leader. If you have more funds than most, you can buy a Intel, IBM or perhaps a Cray.
To clarify: TDP is primarily used as a guideline for manufacturers of thermal solutions (heatsinks/fans, etc)which tells them how much heat their solution should dissipate. TDP is not the maximum power the CPU may generate - there may be periods of time when the CPU dissipates more power than designed, in which case either the CPU temperature will rise closer to the maximum, or special CPU circuitry will activate and add idle cycles or reduce CPU frequency with the intent of reducing the amount of generated power.

TDP is usually 20% - 30% lower than the CPU maximum power dissipation.
In any case I would not expect $200 processor to compete with a $300 to $1000 processor no matter who makes it.

August 30, 2012 | 12:45 AM - Posted by rishidev (not verified)

But where are piledriver CPU's.

August 30, 2012 | 10:49 AM - Posted by Josh Walrath

All the leaked docs say Q4 2012.  So Q4 starts in October.

August 31, 2012 | 10:35 AM - Posted by Anonymous (not verified)

This looks like a piece of wood with some yellow and brown paint on it.

September 11, 2012 | 09:53 AM - Posted by Anonymous (not verified)

I don't know about the watts at the plug but my 8150 runs at 4.2ghz and 44.8C. That is with a tower air cooler.

September 28, 2012 | 01:21 PM - Posted by Anonymous (not verified)

Yeah, mine is running a stable 4.6ghz for going on a month now, it has run at as much of a full load as possible, using Crysis 2 at ultra settings, and Guild Wars 2 and everything maxed, with super-sample enabled (a CPU intense game). My temps never go beyond 52c

September 11, 2012 | 09:54 AM - Posted by Anonymous (not verified)

Sorry, Thats at max load all 8 cores.

October 8, 2012 | 08:19 AM - Posted by Nick (not verified)

My FX -8120 runs at 4.40GHz with all 8-Cores and at full load never breaks past 53C. Piledriver improvements should be better than what Phenom II was to Phenom I. With Steamroller being beyond what we think with a claimed +45% clock for clock performance improvement.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.