Review Index:

AMD A8-7600 Kaveri APU Review - HSA Arrives

Subject: Processors
Manufacturer: AMD

The AMD Kaveri Architecture

Kaveri: AMD’s New Flagship Processor

How big is Kaveri?  We already know the die size of it, but what kind of impact will it have on the marketplace?  Has AMD chosen the right path by focusing on power consumption and HSA?  Starting out an article with three questions in a row is a questionable tactic for any writer, but these are the things that first come to mind when considering a product the likes of Kaveri.  I am hoping we can answer a few of these questions by the end of this article, but alas it seems as though the market will have the final say as to how successful this new architecture is.

AMD has been pursuing the “Future is Fusion” line for several years, but it can be argued that Kaveri is truly the first “Fusion” product that completes the overall vision for where AMD wants to go.  The previous several generations of APUs were initially not all that integrated in a functional sense, but the complexity and completeness of that integration has been improved upon with each iteration.  Kaveri takes this integration to the next step, and one which fulfills the promise of a truly heterogeneous computing solution.  While AMD has the hardware available, we have yet to see if the software companies are willing to leverage the compute power afforded by a robust and programmable graphics unit powered by AMD’s GCN architecture.

(Editor's Note: The following two pages were written by our own Josh Walrath, dicsussing the technology and architecture of AMD Kaveri.  Testing and performance analysis by Ryan Shrout starts on page 3.)

Process Decisions

The first step in understanding Kaveri is taking a look at the process technology that AMD is using for this particular product.  Since AMD divested itself of their manufacturing arm, they have had to rely on GLOBALFOUNDRIES to produce nearly all of their current CPUs and APUs.  Bulldozer, Piledriver, Llano, Trinity, and Richland based parts were all produced on GF’s 32 nm PD-SOI process.  The lower power APUs such as Brazos and Kabini have been produced by TSMC on their 40 nm and 28 nm processes respectively.

View Full Size

Kaveri will take a slightly different approach here.  It will be produced by GLOBALFOUNDRIES, but it will forego the SOI and utilize a bulk silicon process.  28 nm HKMG is very common around the industry, but few pure play foundries were willing to tailor their process to the direct needs of AMD and the Kaveri product.  GF was able to do such a thing.  APUs are a different kind of animal when it comes to fabrication, primarily because the two disparate units require different characteristics to perform at the highest efficiency.  As such, compromises had to be made.

Continue reading our review of the new AMD Kaveri A8-7600 APU!!

GPUs perform best using high density transistors running at lower speeds, as more parallel units can be packed into a chip.  The lower clock speeds are not necessarily a hindrance to these massively parallel processors, so the focus is primarily that of maximizing transistor count to die space.  CPUs on the other hand seem to work better with more spacing between transistors and being able to run at a higher clock speed without breaking any power and TDP envelopes.  These are generalizations, but the truth of the matter is that CPUs and GPUs are very different beasts when it comes to design considerations at a very low level.

The 28 nm bulk/HKMG process at GF is more of a compromise that is optimized for good performance for both the GPU and CPU.  It offers good enough density and good enough speed to make for a competitive product in the marketplace.  It is a bit more biased towards the GPU portion, as the CPU takes a hit when it starts to run at the higher TDPs.  So at 95 watts, the CPU portion of Kaveri is running as fast as it can while being constrained by TDP concerns.  Even though 28 nm HKMG in theory should offer a little more headroom than the previous 32 nm PD-SOI based process, in the end Kaveri will run oh-so-slightly slower than the previous generation Richland in terms of raw CPU clockspeed.  The GPU portion will run significantly slower than the previous VLIW4 based part in Richland.  These are not necessarily bad things, because the efficiency improvements in both the CPU and GPU offset the clockspeed disadvantages.


Steamroller Improvements

Some years back AMD decided to go the CMT (clustered multi-threading) route for multi-threaded efficiency vs. die cost.  The first product to sport these new cores was the Bulldozer based FX-8150.  The results were not very positive.  The part showed some real issues with power consumption, heat production, and single threaded performance.  While it did very well in heavily multi-threaded apps, it was not exactly a winning formula.  The next update to the architecture was Piledriver.  This is found in both the Trinity/Richland line of APUs as well as the FX 8300/6300/4300 series of parts.  Piledriver had some small improvements in performance per clock, but the biggest improvement was power.  Piledriver did not get as hot or pull as much power per clock as did Bulldozer.

View Full Size

Kaveri introduces the new Steamroller architecture for the CPU portion of the APU.  Steamroller is another improvement over Piledriver, especially in terms of performance per clock.  Kaveri is comprised of two Steamroller modules which each contain two cores, so a two module unit can address four threads.  The front end of the module was reworked in a very significant manner to improve not only single thread performance, but also multi-threaded performance as well.

The biggest improvement is the addition of another decoder.  Previous iterations only had one instruction decode unit per module, so each module was limited to one thread per clock.  We can see right off the bat that single threaded performance will suffer because a good portion of the execution units in each core will be waiting for instructions every clock.  Multi-threading also suffers because each module only addresses half of the potential threads vs. core count.

AMD did not just stop there.  They improved essentially every piece of the front end, as well as how the D-caches handle and store data.  The integer and floating point units look to be left untouched, but every other aspect of the chip was touched upon and improved by AMD’s engineers.  The integer and floating point/SIMD units were seemingly fast enough for the job, but they just could not be fed data and instructions effectively and efficiently.

AMD showed us estimates of a peak 20% improvement in performance per clock.  They then told us that in most real-world situations that number is likely to be 10%.  Still, this is a pretty big jump in single thread performance, and it will be able to handle multi-threaded loads more efficiently as well.

Power does not seem to be an issue with this design, though as mentioned in the process section AMD did take a hit in CPU performance in the high TDP range.  With more tweaking of the process we can expect faster parts to be released down the line, but for the now the A10-7850K will be the top SKU for this introduction.  Also, AMD will be offering these products in the 15 watt TDP range later on this year.  That is a pretty significant range of TDPs for essentially a single design.  AMD did disclose all of the power saving features, but they seem to be very comparable to what was introduced with Richland.


Definition of Compute Cores

AMD is coming out with a new description for cores with Kaveri.  Compute cores were bandied about during the tech day, and they actually make a bit of sense.  At CES, NVIDIA came out with their “192 core” Tegra K1, but that actually seems a bit of a misnomer as compared to how AMD is defining “cores”.  Those Tegra cores are more akin to SIMD units than standalone cores.  My understanding is that a single SMX unit could be considered a “compute core”.

View Full Size

On the other hand, AMD’s GCN compute clusters can be defined as cores in a more historical sense.  The top end APU has a total of 12 compute cores; 4 of them are the CPU cores in the Steamroller modules, while the other 8 are the GCN units.  Each GCN unit contains 4 x 16 wide vector units (SIMD), a single scalar unit, branch and message unit, a scheduler, texture and texture fetch units, and a bunch of cache.  Each GCN unit has around 146 KB of cache divided between vector registers, a scalar register, local data share, and L1 cache.  It also has such basics as a program counter, which certainly fits in with their traditional definition of cores.  Each GCN unit can theoretically assign new jobs/work to the CPU when needed.  While you certainly can’t boot up an OS from a GCN core, it can do a significant amount of work independently from the CPU.

January 15, 2014 | 03:05 AM - Posted by razor512

It seems AMD is still in the business of shooting them self in the foot with an RPG.

After the Phenom II series of CPU, their new core module design has lead to around a 50% lower IPC on each core.

For example, compare the fx8350 to the X6 1100t it beats it in overall performance by a very small margin, and largely fails against it when it comes to single threaded performance, all while having a 700MHz across 8 cores, lead over the Phenom II.

AMD is still using that flawed design, and it can barely compete with a dual core, core i3. This is a low end part and thus it will not attract high end users, and it will not attract gamers.

It is very niche because it will struggle to even attract general computer users as the most common tasks done by general/ entry level users, rely more on single threaded performance *which the core i3 offers nearly twice the single threaded performance.

Furthermore, this CPU is unlikely to attract even casual gamers, as the games they commonly play, do not even need the GPU horsepower of the APU, and if they are into running demanding games, they will likely want to run them at high settings. But it will be too much to ask to have them give up nearly half of their single threaded performance for slightly better GPU performance (keep in mind that browsers and many other common applications are still single threaded.

January 15, 2014 | 04:18 AM - Posted by dragosmp (not verified)

AT implies AMD has one more bulldozer-design before passing at a new architecture, the Excavator.

I think someone at AMD must have realized that speed demon CPU architectures aren't appropriate for nowadays' computing demands and will fall back to a wide and high IPC core. They may do as Intel with Conroe and fall back to their mobile architecture: a modified (wider and cache-beefed) Jaguar. That however takes time, so in 2016 maybe we'll see a competitive AMD CPU.

Bone stock Phenom II on 32/28nm would be a better CPU than the FX line, but that train left the station years ago.

January 15, 2014 | 04:51 AM - Posted by razor512

For me I am just annoyed, I currently have a Phenom II x6 1075t overclocked to 4GHz and, and to match its performance, an 8350 would have to be overclocked to nearly 5GHz

While there are some things that it does faster, overall the core module crap has lead to a slower CPU because the vast majority of the computing we do, still rely heavily on single threaded performance.

This is why intel has been dong so well. virtually every new generation increased IPC. after the pentium 4 and pentium D, they made sure to never sacrifice IPC for a higher core count.

AMD on the other hand slowly increased ICP until the Phenom II, and then took a massive dive in order to make an 8 "core" CPU.

AMD's core module is the equivalent of going to the pizza shot and ordering 2 large pizzas, and the shop simply just putting an extra crust over the top of a single large pizza. sure it is more food but it is no longer as good and it is not 2 large pizzas.

I cannot see AMD as a valid choice now until they step away from this core module crap and go back to improving IPC

High core counts are of very little benefit. multiple CPU cores do not scale perfectly, 1 core at twice the ICP will perform better than 2 cores at half the IPC each.

AMD should have built upon the Phenom II and made it 28nm or smaller, and optimized it to improve IPC

January 16, 2014 | 02:23 AM - Posted by Guest (not verified)

Amen to that.

January 16, 2014 | 02:27 AM - Posted by Buildzoid (not verified)

According to HWbot a Phenom II x6 1075T @ 4.2Ghz gets close to stock FX8350 multi threaded performance whereas most FX 8 cores at over 4.5Ghz beat the the 1075T. FX IPC changes according to how many cores are loaded if you only use 1 core the stock FX 8350 gets around 1.23(A Phenom II x4 I had got 1.25 at 4Ghz) points in single threaded Cinebench test but once you use all 8 cores it will only just get over 7 points which points to the fact that once 2 cores in a module have to work at the same time they slow down due to the resource sharing. So no your 1075T is not better than an FX8350 it's close but not better.
BTW with very repetitive parallel work loads like crunching prime numbers/video encoding an FX 8350 will catch up to and sometimes beat an i7 3930k/3960X/4930k/4960X.

January 15, 2014 | 04:04 AM - Posted by Prodeous

@ Ryan and PCPer team.

Are you planning to compare Kaveri vs Richland vs FX series at the same clock to see how the IPC truly compare?

It would be nice to see Steamroler B vs Piledriver B vs Piledriver A cores?

A-B refering to APU vs non APU which contains also L3 cache that is missing from Kaveri.

Heck, you can even through in the first Bulldozer to give us a view of how much improvement has the Bulldozer series brought over time.

January 15, 2014 | 01:45 PM - Posted by Anonymous (not verified)

Strange thing about those kaveri's 8 compute "Cores" (on the GPU side a least) is that they(the ACEs) can actually do context switching, which no prior AMD GPUs could do, according to what Charlie the D., says "Once a GPU can context switch, it is essentially a very wide heterogeneous CPU, and that is exactly the point of Kaveri." and "...just wait until the software catches up...", so there is not a lot of software out there compiled to take advantage of AMD's version of HSA, (hQ) and atomics, and such, but as soon as the SDKs, frameworks, APIs, begin to take advantage of AMD's special brand of HSA, things should look different. The Benchmarking software will have to account for this in some way (when needed), and will have to be re-run to measure this New AMD HSA tuned hardware once the changes work their way into the software ecosystem!
I wonder how this will affect the Ray tracing benchmarks, in particular, being able to have 8 of the GPU Cores/ACEs that have the ability to context switch between their own ( hQ max depths of up to 8 threads per ACE), all doing ray tracing, in addition to the 4 ARM cores, and their SIMD instructions! Kaveri is still, pending the release of truely compatible, and optimized software, very much a wait and see siduation.

People may not like Charlie D., but his techinical hardware analysis skills are top notch!

January 15, 2014 | 02:54 PM - Posted by razor512

I would say also add the phenom II in there

Run them all at 3GHz, or match the clock speed of the slowest CPU, then benchmark the single threaded and multi core performance of each

comparing everything from the Phenom II, to the kaveri.

January 16, 2014 | 12:53 AM - Posted by Steel-Fenix-Computers (not verified)

Im not a computer engineer (although sometimes i wish i was given the ODD decisions made by these companies...) but i have to agree that a Phenom II die shrink/tweak would have been best for least for the FX line and left the "experimental" , module architecture for the APU line until HSA was fully implemented. A die shrunk Phenom II x4 and x6 with an updated memory controller would still have been able to compete at the time bulldozer was released if they kept the costs low just boggles my mind that no one in-house wouldnt have seen this?? They could have even labeled it "Phenom III" instead of resurrecting the FX label. The only thing really keeping Phenoms back nowadays is a few instruction sets and the memory controller!! That architecture at 4ghz+ would have been a hell of a thing!

January 16, 2014 | 02:19 AM - Posted by Guest (not verified)

amen to that! PII is still there best architecture. They should have just improved it and their high end cpus would be a far different story.

January 15, 2014 | 02:21 PM - Posted by Anonymous (not verified)

Ryan, you are an expert on PC technology.

It would be great if you could provide your daily use impressions of the AMD A8-7600 65W/45W APU system vs Intel core i3-4330 system.

Newegg is selling the Intel Core i3-4330 for $140 dollars. The AMD A8-7600 APU is priced at $119.

The two systems are in the same price range.

January 15, 2014 | 04:27 PM - Posted by Anonymous (not verified)

No the posting system would not confirm that the post was made, in fact it displayed an error message after each post attempt, and assuming the post itself was not properly posted, the poster continued to try to post!

And Now that the poster thinks about it, maybe the error message was actually concerning the Failure of the post affirmation message, and not the posting mechanism that did its job of posting after each cryptic error message!

So maybe the logic of the posting system needs to be changed to not fully complete the posting transaction, unless the affirmation message transaction completes without Throwing an error!

this post took more than one try also, this posting system breaks down after the post counts reaches past one page in length!

January 16, 2014 | 11:34 AM - Posted by Ryan Shrout

I'll look into it!

January 16, 2014 | 11:34 AM - Posted by Markers

Conversation at work as follows.

AMD is giving a copy of BF4 away with a purchase of the A10-7850. It has been posited that Kaveri will be "good enough" for a low cost system to play BF4 because it's coming with the processor. I've tried to temper their enthusiasm with my own experience running a GTX660 with an i5-3570k.

What is game play going to be like with just the newest APU on a 1920x1080 display?

I assume hybrid CrossFire will also be possible, but what is the max card that could be used (Some thought as high as a Radeon 7850)? I was skeptical, I wasn't sure you'd even be able to run a 77xx card and get any good out of the iGPU.

Good article and insights as always. Keep up the great work.

January 17, 2014 | 03:50 AM - Posted by Fernandk (not verified)

i don't think this CPU should be placed against a Intel's GT2, price-wise and also platform-wise, yeah sure it makes sense... but generation-wise it definitely doesn't.. instead i think it should be compared with Intel's GT3 or even Intel's GT3e

January 19, 2014 | 02:23 AM - Posted by pwrntspd (not verified)

Well, I can't say that I'm not a little disappointed in amds latest desktop offering. That said, I'm now very excited to see how this new manufacturing process and low power tuning have worked for their new mobile kaveri. I don't expect amd will ever conquer the high end cpu market again but I think they have all the tools to dominate in tablets and convertable notebooks. Its a matter of where Read leads the company. Either way, great review josh and ryan, I look forward to mobile reviews of kaveri (I hope) by second quarter.

January 20, 2014 | 02:54 AM - Posted by Martin Trautvetter

Anyone else curious about the power consumption of the 45W setting during other benchmarks after seeing the gaming power consumption chart?

Cause that one looks peculiar.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.