Review Index:
Feedback

AMD A8-7600 Kaveri APU Review - HSA Arrives

Author: Ryan Shrout
Subject: Processors
Manufacturer: AMD

Kaveri Tech, Continued

The Buses

AMD did not talk about a lot of the internal plumbing with Kaveri.  We were able to glean a minimal amount of information, but I stress that it is minimal.

The Onion (coherent) bus and Garlic (non-coherent) bus were both improved over the previous generation of products.  AMD did not go into great detail other than to say that bandwidth is improved.  Low level changes had to be made to these busses to support HSA, but again details were left out.  The memory controller was also massively reworked to support the shared memory architecture as well as provide more performance and efficiency when dealing with such loads.  It looks as though it also supports memory speeds up to DDR3-2400 levels right out of the box.

View Full Size

Kaveri now officially supports PCI-E 3.0.  This feature was actually designed into Trinity/Richland, but AMD did not spend the time or money to certify their unit for that specification.  When I spoke with them last year about this they simply said it was not really worth it considering the marketplace they were focusing on.  To a great degree, this is likely true.  Trinity/Richland were far less likely to need the high speed interconnects that PCI-E 3.0 offered when it came to RAID controllers, PCI-E SSDs, or graphics cards.  Now that it is 2014, AMD has marked off the PCI-E 3.0 checkbox for their OEM partners and have opened the door for future, higher performing FX processors utilizing the FM2+ socket infrastructure.

                                                       

GCN in Kaveri

Graphics Core Next is the name of the next generation graphics architecture from AMD that was first introduced in early 2012 with the HD 7000 series of parts.  This was designed from the outset to be very efficient and highly programmable.  It also turned out to be very powerful as well.  The GCN portions included in Kaveri are nearly identical to those in the latest “Hawaii” based R9 290 graphics cards.

View Full Size

Each GCN compute core features 4 x 16 wide vector units, a single scalar unit, plenty of cache, associated texture and texture store units, as well as the scheduler.  A total of 128 flops/clock can be achieved with each compute core, so that adds up rather quickly when there are multiple compute cores running at 720 MHz.  The big improvement is the addition of the shared, coherent unified memory feature that is the foundation of HSA.

View Full Size

A total of two RBEs are included in Kaveri, which gives it a fairly decent pixel fill rate as compared to previous integrated parts.  This gives a total of 8 color ROPs and 32 stencil/Z ROPs.  I believe this is double that of previous products from AMD and Intel.

View Full Size

Kaveri also supports Mantle.  This should be a nice boost in overall performance in games that adopt Mantle.  While Battlefield 4 will support Mantle “soon”, initial results showed approximately a 45% increase in frames per second from the standard Direct3D version to Mantle.  We also saw a few other native Mantle implementations that produced impressive results in performance due to the smaller amount of draw calls in complex situations.

 

Accelerators

APUs are seemingly chock full of accelerators.  These are individual units which are aimed at accelerating specific workloads.  It is far more efficient in these cases to design and implement an accelerator than to utilize the multi-core CPU or the GCN architecture to handle that workload.  This saves on power to a very great degree, all the while minimizing the die footprint of such a unit.

View Full Size

Kaveri includes a very significant and new accelerator with their TrueAudio unit.  This unit contains multiple DSPs to accelerate certain audio features to improve sound quality and 3D immersion.  A handful of games will be coming out in 1H 2014 that will natively support TrueAudio.  If I were to characterize this part, I would say that it is very similar to what Aureal tried to accomplish with their A3D 2.0 implementation.  It is a step above what Creative has done with their latest EAX 5.0 based specification as well.  The real kicker here is that even though Aureal won the lawsuit brought against them by Creative, they spent so much in legal fees that it essentially bankrupted the company.  Creative then swooped in and bought their IP.  Then they sat on it and did absolutely nothing while relying on EAX to push good 3D audio to users.  That was an absolute failure.  AMD is trying to make 3D audio relevant again by introducing their TrueAudio unit.  Having this native to every Kaveri APU shipped will likely help push the specification and support further than if they released a standalone sound card embracing that functionality.

TrueAudio has uses outside of gaming applications.  Nuance is developing a noise reduction addition to their software that will utilize the TrueAudio DSP to accelerate operations for them.  This unit apparently is quite programmable and can be used for a variety of applications.

View Full Size

Video playback and encoding are the two primary accelerators that have been included in APUs since day one.  The VCE 2 (video coding engine) is a highly upgraded unit as compared to VCE 1.  We can see in the slide below the changes between the two.

UVD 4 is the latest iteration of the Unified Video Decoder that was introduced many generations ago with AMD graphics cards.  The only improvement this sees is improved error resiliency.  When something is poorly encoded and contains errors, the UVD unit will not lock up and continue to show the last good frame while audio keeps moving forward.  The corrupted frame will be skipped and the video will move forward in sync with audio.

AMD does not have a H.265 decoder yet, but it will be supported through the GCN units.  This does expend more power than a more focused accelerator, but those hard coded accelerated units do take time to design and implement.  The flexibility of the GCN architecture allows it to do work such as H.265 decode without maxing out the CPUs to keep up with the workload.

 

HSA

Kaveri finally fulfills the promise of a true Heterogeneous System Architecture.  The shared memory space and addressing (hUMA), the ability for the GCN units to handle and assign threads as needed (hQ), and a growing software and programming ecosystem that can take advantage of the potential horsepower offered by this APU are working together to maximize the potential of this architecture.

View Full Size

Code complexity with HSA will diminish significantly.  The use of shared memory and pointers allows the CPU and GPU to access memory without having to do copies from CPU memory to GPU memory and vice versa.  Programming tools are also either available or are being developed to support HSA so that programmers do not have to veer too far away from what they are comfortable with.  Java is the big target for AMD right now due to how many current applications are based off of that language.  They are working closely with Oracle to make sure that Java supports HSA at a very low level.  This past year Oracle joined up with the HSA Foundation.

View Full Size

The flexibility of HSA was also mentioned above.  New codecs such as H.265/HEVC are not supported with current accelerators, but can be accelerated through OpenCL.  This will be true for other upcoming standards that do not yet have accelerators designed for them, or would run more efficiently on massively parallel units rather than multi-core CPUs.

View Full Size

HSA is supported through software like OpenCL or C++ AMP, but some of the low level OS routines will not catch up to HSA for a while.  Linux will be receiving such updates first, but it will still be a couple of years down the road after HSA is officially ratified by the Foundation.

 

Kaveri: A Leap

We have been learning about Kaveri for years now.  Few of the details have been hidden to us, and certainly not for long.  Processor Forums, editor’s days, leaks, and investor meetings have taught us a lot about what AMD wants to do with their APUs.  Their goals are pretty lofty, but there is a lot of momentum swinging towards heterogeneous systems.  ARM is pushing that way, NVIDIA has a big stake in GPGPU, and even Intel is pushing massively parallel computing (though in a different way).

View Full Size

Kaveri is a complex and potentially groundbreaking part.  One of the really big strengths of the chip is that a user does not lose the performance or functionality of the graphics portion when using a separate video card.  This could potentially have a big impact on applications which can leverage that piece of silicon.  Think of games with lots of AI and physics computations being done on the APU while the graphics card handles only the tessellation, geometry, vertex, and pixel shading.  AI and physics on an APU with shared memory is far more efficient than if running on a standalone card with its own memory.  Things like collision and interaction will be faster and more organic because a program can utilize the same memory space for the CPU and GPU portions of the APU.

Ryan now takes over with the hard numbers on this APU and we get to hear his impressions of the architecture after testing it for the past few days.

January 15, 2014 | 03:05 AM - Posted by razor512

It seems AMD is still in the business of shooting them self in the foot with an RPG.

After the Phenom II series of CPU, their new core module design has lead to around a 50% lower IPC on each core.

For example, compare the fx8350 to the X6 1100t it beats it in overall performance by a very small margin, and largely fails against it when it comes to single threaded performance, all while having a 700MHz across 8 cores, lead over the Phenom II.

AMD is still using that flawed design, and it can barely compete with a dual core, core i3. This is a low end part and thus it will not attract high end users, and it will not attract gamers.

It is very niche because it will struggle to even attract general computer users as the most common tasks done by general/ entry level users, rely more on single threaded performance *which the core i3 offers nearly twice the single threaded performance.

Furthermore, this CPU is unlikely to attract even casual gamers, as the games they commonly play, do not even need the GPU horsepower of the APU, and if they are into running demanding games, they will likely want to run them at high settings. But it will be too much to ask to have them give up nearly half of their single threaded performance for slightly better GPU performance (keep in mind that browsers and many other common applications are still single threaded.

January 15, 2014 | 04:18 AM - Posted by dragosmp (not verified)

AT implies AMD has one more bulldozer-design before passing at a new architecture, the Excavator.

I think someone at AMD must have realized that speed demon CPU architectures aren't appropriate for nowadays' computing demands and will fall back to a wide and high IPC core. They may do as Intel with Conroe and fall back to their mobile architecture: a modified (wider and cache-beefed) Jaguar. That however takes time, so in 2016 maybe we'll see a competitive AMD CPU.

Bone stock Phenom II on 32/28nm would be a better CPU than the FX line, but that train left the station years ago.

January 15, 2014 | 04:51 AM - Posted by razor512

For me I am just annoyed, I currently have a Phenom II x6 1075t overclocked to 4GHz and, and to match its performance, an 8350 would have to be overclocked to nearly 5GHz

While there are some things that it does faster, overall the core module crap has lead to a slower CPU because the vast majority of the computing we do, still rely heavily on single threaded performance.

This is why intel has been dong so well. virtually every new generation increased IPC. after the pentium 4 and pentium D, they made sure to never sacrifice IPC for a higher core count.

AMD on the other hand slowly increased ICP until the Phenom II, and then took a massive dive in order to make an 8 "core" CPU.

AMD's core module is the equivalent of going to the pizza shot and ordering 2 large pizzas, and the shop simply just putting an extra crust over the top of a single large pizza. sure it is more food but it is no longer as good and it is not 2 large pizzas.

I cannot see AMD as a valid choice now until they step away from this core module crap and go back to improving IPC

High core counts are of very little benefit. multiple CPU cores do not scale perfectly, 1 core at twice the ICP will perform better than 2 cores at half the IPC each.

AMD should have built upon the Phenom II and made it 28nm or smaller, and optimized it to improve IPC

January 16, 2014 | 02:23 AM - Posted by Guest (not verified)

Amen to that.

January 16, 2014 | 02:27 AM - Posted by Buildzoid (not verified)

According to HWbot a Phenom II x6 1075T @ 4.2Ghz gets close to stock FX8350 multi threaded performance whereas most FX 8 cores at over 4.5Ghz beat the the 1075T. FX IPC changes according to how many cores are loaded if you only use 1 core the stock FX 8350 gets around 1.23(A Phenom II x4 I had got 1.25 at 4Ghz) points in single threaded Cinebench test but once you use all 8 cores it will only just get over 7 points which points to the fact that once 2 cores in a module have to work at the same time they slow down due to the resource sharing. So no your 1075T is not better than an FX8350 it's close but not better.
BTW with very repetitive parallel work loads like crunching prime numbers/video encoding an FX 8350 will catch up to and sometimes beat an i7 3930k/3960X/4930k/4960X.

January 15, 2014 | 04:04 AM - Posted by Prodeous

@ Ryan and PCPer team.

Are you planning to compare Kaveri vs Richland vs FX series at the same clock to see how the IPC truly compare?

It would be nice to see Steamroler B vs Piledriver B vs Piledriver A cores?

A-B refering to APU vs non APU which contains also L3 cache that is missing from Kaveri.

Heck, you can even through in the first Bulldozer to give us a view of how much improvement has the Bulldozer series brought over time.

January 15, 2014 | 01:45 PM - Posted by Anonymous (not verified)

Strange thing about those kaveri's 8 compute "Cores" (on the GPU side a least) is that they(the ACEs) can actually do context switching, which no prior AMD GPUs could do, according to what Charlie the D., says "Once a GPU can context switch, it is essentially a very wide heterogeneous CPU, and that is exactly the point of Kaveri." and "...just wait until the software catches up...", so there is not a lot of software out there compiled to take advantage of AMD's version of HSA, (hQ) and atomics, and such, but as soon as the SDKs, frameworks, APIs, begin to take advantage of AMD's special brand of HSA, things should look different. The Benchmarking software will have to account for this in some way (when needed), and will have to be re-run to measure this New AMD HSA tuned hardware once the changes work their way into the software ecosystem!
I wonder how this will affect the Ray tracing benchmarks, in particular, being able to have 8 of the GPU Cores/ACEs that have the ability to context switch between their own ( hQ max depths of up to 8 threads per ACE), all doing ray tracing, in addition to the 4 ARM cores, and their SIMD instructions! Kaveri is still, pending the release of truely compatible, and optimized software, very much a wait and see siduation.

People may not like Charlie D., but his techinical hardware analysis skills are top notch!

January 15, 2014 | 02:54 PM - Posted by razor512

I would say also add the phenom II in there

Run them all at 3GHz, or match the clock speed of the slowest CPU, then benchmark the single threaded and multi core performance of each

comparing everything from the Phenom II, to the kaveri.

January 16, 2014 | 12:53 AM - Posted by Steel-Fenix-Computers (not verified)

Im not a computer engineer (although sometimes i wish i was given the ODD decisions made by these companies...) but i have to agree that a Phenom II die shrink/tweak would have been best for AMD...at least for the FX line and left the "experimental" , module architecture for the APU line until HSA was fully implemented. A die shrunk Phenom II x4 and x6 with an updated memory controller would still have been able to compete at the time bulldozer was released if they kept the costs low enough.....it just boggles my mind that no one in-house wouldnt have seen this?? They could have even labeled it "Phenom III" instead of resurrecting the FX label. The only thing really keeping Phenoms back nowadays is a few instruction sets and the memory controller!! That architecture at 4ghz+ would have been a hell of a thing!

January 16, 2014 | 02:19 AM - Posted by Guest (not verified)

amen to that! PII is still there best architecture. They should have just improved it and their high end cpus would be a far different story.

January 15, 2014 | 02:21 PM - Posted by Anonymous (not verified)

Ryan, you are an expert on PC technology.

It would be great if you could provide your daily use impressions of the AMD A8-7600 65W/45W APU system vs Intel core i3-4330 system.

Newegg is selling the Intel Core i3-4330 for $140 dollars. The AMD A8-7600 APU is priced at $119.

The two systems are in the same price range.

January 15, 2014 | 04:27 PM - Posted by Anonymous (not verified)

No the posting system would not confirm that the post was made, in fact it displayed an error message after each post attempt, and assuming the post itself was not properly posted, the poster continued to try to post!

And Now that the poster thinks about it, maybe the error message was actually concerning the Failure of the post affirmation message, and not the posting mechanism that did its job of posting after each cryptic error message!

So maybe the logic of the posting system needs to be changed to not fully complete the posting transaction, unless the affirmation message transaction completes without Throwing an error!

this post took more than one try also, this posting system breaks down after the post counts reaches past one page in length!

January 16, 2014 | 11:34 AM - Posted by Ryan Shrout

I'll look into it!

January 16, 2014 | 11:34 AM - Posted by Markers

Conversation at work as follows.

AMD is giving a copy of BF4 away with a purchase of the A10-7850. It has been posited that Kaveri will be "good enough" for a low cost system to play BF4 because it's coming with the processor. I've tried to temper their enthusiasm with my own experience running a GTX660 with an i5-3570k.

What is game play going to be like with just the newest APU on a 1920x1080 display?

I assume hybrid CrossFire will also be possible, but what is the max card that could be used (Some thought as high as a Radeon 7850)? I was skeptical, I wasn't sure you'd even be able to run a 77xx card and get any good out of the iGPU.

Good article and insights as always. Keep up the great work.

January 17, 2014 | 03:50 AM - Posted by Fernandk (not verified)

i don't think this CPU should be placed against a Intel's GT2, price-wise and also platform-wise, yeah sure it makes sense... but generation-wise it definitely doesn't.. instead i think it should be compared with Intel's GT3 or even Intel's GT3e

January 19, 2014 | 02:23 AM - Posted by pwrntspd (not verified)

Well, I can't say that I'm not a little disappointed in amds latest desktop offering. That said, I'm now very excited to see how this new manufacturing process and low power tuning have worked for their new mobile kaveri. I don't expect amd will ever conquer the high end cpu market again but I think they have all the tools to dominate in tablets and convertable notebooks. Its a matter of where Read leads the company. Either way, great review josh and ryan, I look forward to mobile reviews of kaveri (I hope) by second quarter.

January 20, 2014 | 02:54 AM - Posted by Martin Trautvetter

Anyone else curious about the power consumption of the 45W setting during other benchmarks after seeing the gaming power consumption chart?

Cause that one looks peculiar.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.