Clean Sheet and New Focus
It is no secret that AMD has been struggling for some time. The company has had success through the years, but it seems that the last decade has been somewhat bleak in terms of competitive advantages. The company has certainly made an impact in throughout the decades with their 486 products, K6, the original Athlon, and the industry changing Athlon 64. Since that time we have had a couple of bright spots with the Phenom II being far more competitive than expected, and the introduction of very solid graphics performance in their APUs.
Sadly for AMD their investment in the “Bulldozer” architecture was misplaced for where the industry was heading. While we certainly see far more software support for multi-threaded CPUs, IPC is still extremely important for most workloads. The original Bulldozer was somewhat rushed to market and was not fully optimized, while the “Piledriver” based Vishera products fixed many of these issues we have not seen the non-APU products updated to the latest Steamroller and Excavator architectures. The non-APU desktop market has been served for the past four years with 32nm PD-SOI based parts that utilize a rebranded chipset base that has not changed since 2010.
Four years ago AMD decided to change course entirely with their desktop and server CPUs. Instead of evolving the “Bulldozer” style architecture featuring CMT (Core Multi-Threading) they were going to do a clean sheet design that focused on efficiency, IPC, and scalability. While Bulldozer certainly could scale the thread count fairly effectively, the overall performance targets and clockspeeds needed to compete with Intel were just not feasible considering the challenges of process technology. AMD brought back Jim Keller to lead this effort, an industry veteran with a huge amount of experience across multiple architectures. Zen was born.
Hot Chips 28
This year’s Hot Chips is the first deep dive that we have received about the features of the Zen architecture. Mike Clark is taking us through all of the changes and advances that we can expect with the upcoming Zen products.
Zen is a clean sheet design that borrows very little from previous architectures. This is not to say that concepts that worked well in previous architectures were not revisited and optimized, but the overall floorplan has changed dramatically from what we have seen in the past. AMD did not stand still with their Bulldozer products, and the latest Excavator core does improve upon the power consumption and performance of the original. This evolution was simply not enough considering market pressures and Intel’s steady improvement of their core architecture year upon year. Zen was designed to significantly improve IPC and AMD claims that this product has a whopping 40% increase in IPC (instructions per clock) from the latest Excavator core.
AMD also has focused on scaling the Zen architecture from low power envelopes up to server level TDPs. The company looks to have pushed down the top end power envelope of Zen from the 125+ watts of Bulldozer/Vishera into the more acceptable 95 to 100 watt range. This also has allowed them to scale Zen down to the 15 to 25 watt TDP levels without sacrificing performance or overall efficiency. Most architectures have sweet spots where they tend to perform best. Vishera for example could scale nicely from 95 to 220 watts, but the design did not translate well into sub-65 watt envelopes. Excavator based “Carrizo” products on the other hand could scale from 15 watts to 65 watts without real problems, but became terribly inefficient above 65 watts with increased clockspeeds. Zen looks to address these differences by being able to scale from sub-25 watt TDPs up to 95 or 100. In theory this should allow AMD to simplify their product stack by offering a common architecture across multiple platforms.
Bristol Ridge Takes on Mobile: E2 Through FX
It is no secret that AMD has faced an uphill battle since the release of the original Core 2 processors from Intel. While stayed mostly competitive through the Phenom II years, they hit some major performance issues when moving to the Bulldozer architecture. While on paper the idea of Chip Multi-Threading sounded fantastic, AMD was never able to get the per thread performance up to expectations. While their CPUs performed well in heavily multi-threaded applications, they just were never seen in as positive of a light as the competing Intel products.
The other part of the performance equation that has hammered AMD is the lack of a new process node that would allow it to more adequately compete with Intel. When AMD was at 32 nm PD-SOI, Intel had introduced its 22nm TriGate/FinFET. AMD then transitioned to a 28nm HKMG planar process that was more size optimized than 32nm, but did not drastically improve upon power and transistor switching performance.
So AMD had a double whammy on their hands with an underperforming architecture and limitted to no access to advanced process nodes that would actually improve their power and speed situation. They could not force their foundry partners to spend billions on a crash course in FinFET technology to bring that to market faster, so they had to iterate and innovate on their designs.
Bristol Ridge is the fruit of that particular labor. It is also the end point to the architecture that was introduced with Bulldozer way back in 2011.
Subject: Editorial | September 18, 2015 - 01:00 PM | Josh Walrath
Tagged: Zen, raja koduri, lisa su, Jim Keller, bulldozer, amd
2012 was a significant year for AMD. Many of the top executives left and there were many new and exciting hires at the company. Lisa Su, who would eventually become President and CEO of AMD was hired in January of that year. Rory Read seemed to be on a roll with many measures to turn around the company. He also convinced some big name folks to come back to AMD from other lucrative positions. One of these rehires was Jim Keller.
Jim Keller, breakin it down for AMD. Or doing "The Robot". Or both.
Today it was announced that Jim would be leaving AMD effective Sept. 18th. He was back at AMD for three years and in that time headed up the CPU group. He implemented massive changes that would result in the design of the upcoming Zen architecture. There was a full scale ejection of the Bulldozer concept that powered AMD processors since 2011 with the FX-8150 introduction with the current Excavator core design to last through 2016 with the final product being "Bristol Ridge,"expected next summer. Zen will not ship until late 2016 with the first full quarter of revenue in 2017.
Jim helped to develop the K7 and K8 processors from AMD. He also was extremely influential in the creation of the X86-64 ISA that not only powers AMD’s parts, but also was adopted by Intel after their disastrous EPIC/IA64 ISA failed to go anywhere. His past also includes work at DEC on the Alpha processors and before AMD at Apple working on the A4 and A5 SOCs.
We do not know any of the details about his leaving, and perhaps never will. AMD has released an official statement that “Jim Keller is leaving AMD to pursue other opportunities, effective September 18”. Looking at Jim’s past employment, he seems to move around a bit. Perhaps he enjoys coming into a place, turning things around, implementing some new thinking, but then becomes bored with the daily routine of management, budget, and planning.
In the near future this change will not affect AMD’s roadmaps or product lineups. We still will see Bristol Ridge as the follow-up for Godavari in Summer 2016 and the late 2016 introduction of Zen. What can be said beyond that is hard to quantify. There are a lot of smart and talented people still working at AMD and perhaps this allows someone there to step up and introduce the next generation of architectures and thinking at AMD. Everybody likes the idea of a rockstar designer coming in to shake things up, but time moves on and new people become those rockstars.
We wish Jim well on his new journey and hope that this is not a harbinger of things to come for AMD. Consumers need the competition that AMD brings to the table and we certainly hope we see them continue to release new products and stay on a schedule that will benefit both them and consumers. Perhaps he will join fellow veteran Glenn Henry at VIA/Centaur and produce the next, great X86-64 chip. Perhaps not.
Digging into a specific market
A little while ago, I decided to think about processor design as a game. You are given a budget of complexity, which is determined by your process node, power, heat, die size, and so forth, and the objective is to lay out features in the way that suits your goal and workload best. While not the topic of today's post, GPUs are a great example of what I mean. They make the assumption that in a batch of work, nearby tasks are very similar, such as the math behind two neighboring pixels on the screen. This assumption allows GPU manufacturers to save complexity by chaining dozens of cores together into not-quite-independent work groups. The circuit fits the work better, and thus it lets more get done in the same complexity budget.
Carrizo is aiming at a 63 million unit per year market segment.
This article is about Carrizo, though. This is AMD's sixth-generation APU, starting with Llano's release in June 2011. For this launch, Carrizo is targeting the 15W and 35W power envelopes for $400-$700 USD notebook devices. AMD needed to increase efficiency on the same, 28nm process that we have seen in their product stack since Kabini and Temash were released in May of 2013. They tasked their engineers to optimize their APU's design for these constraints, which led to dense architectures and clever features on the same budget of complexity, rather than smaller transistors or a bigger die.
15W was their primary target, and they claim to have exceeded their own expectations.
Backing up for a second. Beep. Beep. Beep. Beep.
When I met with AMD last month, I brought up the Bulldozer architecture with many individuals. I suspected that it was a quite clever design that didn't reach its potential because of external factors. As I started this editorial, processor design is a game and, if you can save complexity by knowing your workload, you can do more with less.
Bulldozer looked like it wanted to take a shortcut by cutting elements that its designers believed would be redundant going forward. First and foremost, two cores share a single floating point (decimal) unit. While you need some floating point capacity, upcoming workloads could use the GPU for a massive increase in performance, which is right there on the same die. As such, the complexity that is dedicated to every second FPU can be cut and used for something else. You can see this trend throughout various elements of the architecture.
Subject: Processors | April 27, 2015 - 06:06 PM | Josh Walrath
Tagged: Zen, Steamroller, Kaveria, k12, Excavator, carrizo, bulldozer, amd
There are some pretty breathless analysis of a single leaked block diagram that is supposedly from AMD. This is one of the first indications of what the Zen architecture looks like from a CPU core standpoint. The block diagram is very simple, but looks in the same style as what we have seen from AMD. There are some labels, but this is almost a 50,000 foot view of the architecture rather than a slightly clearer 10,000 foot view.
There are a few things we know for sure about Zen. It is a clean sheet design that moves away from what AMD was pursuing with their Bulldozer family of cores. Zen gives up CMT for SMT support for handling more threads. The design has a cluster of four cores sharing 8 MB of L3 cache, with each core having access to 512 KB of L2 cache. There is a lot of optimism that AMD can kick the trend of falling more and more behind Intel every year with this particular design. Jim Keller is viewed very positively due to his work at AMD in the K7 through K8 days, as well as what he accomplished at Apple with their ARM based offerings.
One of the first sites to pick up this diagram wrote quite a bit about what they saw. There was a lot of talk about, “right off the bat just by looking at the block diagram we can tell that Zen will have substantially higher single threaded performance compared to Excavator and the Bulldozer family.” There was the assumption that because it had two 256-bit FMACs that it could fuse them to create a single 512 bit AVX product.
These assumptions are pretty silly. This is a very simple block diagram that answers few very important questions about the architecture. Yes, it shows 6 int pipelines, but we don’t know how many are address generation vs. execution units. We don’t know how wide decode is. We don’t know latency to L2 cache, much less how L3 is connected and shared out. So just because we see more integer pipelines per core does not automatically mean, “Da, more is better, strong like tractor!” We don’t know what improvements or simplifications we will see in the schedulers. There is no mention of the front-end other than Fetch and Decode. How about Branch Prediction? What is the latency for the memory controller when addressing external memory?
Essentially, this looks like a simplified way of expressing to analysts that AMD is attempting to retain their per core integer performance while boosting floating point/AVX at a similar level. Other than that, there is very little that can be gleaned from this simple block diagram.
Other leaks that are interesting concerning Zen are the formats that we will see these products integrated into. One leak detailed a HPC aimed APU that features 16 Zen cores with 32 MB of L3 cache attached to a very large GPU. Another leak detailed a server level chip that will support 32 cores and will be seen in 2P systems. Zen certainly appears to be very flexible, and in ways it reminds me of a much beefier Jaguar type CPU. My gut feeling is that AMD will get closer to Intel than it has been in years, and perhaps they can catch Intel by surprise with a few extra features. The reality of the situation is that AMD is far behind and only now are we seeing pure-play foundries start to get even close to Intel in terms of process technology. AMD is very much at a disadvantage here.
Still, the company needs to release new, competitive products that will refill the company coffers. The previous quarter’s loss has dug into cash reserves, but AMD is still stable in terms of cash on hand and long term debt. 2015 will see new GPUs, an APU refresh, and the release of the new Carrizo parts. 2016 looks to be the make or break year with Zen and K12.
Edit 2015-04-28: Thanks to SH STON we have a new slide that has been leaked from the same deck as this one. This has some interesting info in that AMD may be going away from exclusive cache designs. Exclusive was a good idea when cache was small and expensive, as data was not replicated through each level of cache (L1 was not replicated in L2 and L2 was not replicated in L3). Intel has been using inclusive cache since forever, where data is replicated and simpler to handle. Now it looks like AMD is moving towards inclusive. This is not necessarily a bad thing as the 512 KB of L2 can easily handle what looks to be 128 KB of L1 and the shared 8 MB of L3 cache can easily handle the 2 MB of L2 data. Here is the link to that slide.
The new slide in question.
Subject: Cases and Cooling | January 2, 2014 - 03:42 PM | Jeremy Hellstrom
Tagged: sharkoon, bulldozer
If subtle just isn't your thing then the Sharkoon Bulldozer might be a good case for you. At 480 x 235 x 460mm (18.9 x 9.2 x 18.1") this is a large case and the numerous LED lights will make this case stand out even more, not to mention the unique paint job on the interior. There are 10 drive bays which can be modified for 3.5" or 2.5" drives with one bay removable to give your GPU some extra space. eTeknix liked the bottom mounted PSU and cable runs which make for a clean looking system but you will really have to like the exterior look if you consider buying this case.
"Sharkoon are not what I would call most people’s first choice when it comes to picking a new chassis, at least not here in the UK. However, we’ve seen a couple of Sharkoon products in the eTeknix office over the last couple of years that really impressed us, not only for being great cases, but also because they offered great value for money. This has left us eager to see more from Sharkoon and today we will be taking a look at their new Bulldozer chassis, a budget friendly ATX chassis that is available in a choice of three colours. Blue, Green or Red LED edition are available, with the blue model coming in a grey chassis, while the green and red LED models come in a black chassis."
Here are some more Cases & Cooling reviews from around the web:
- Cooler Master Elite 130 @ Benchmark Reviews
- Antec P100 Case @ Rbmods
- NZXT Phantom 530 Full Tower Case Review @HiTech Legion
- BitFenix Ronin Midi-Tower Computer Case Review @ Madshrimps
- Noctua NH-U14S 140mm Tower CPU Cooler @ Benchmark Reviews
- Noctua NH-U12S CPU Cooler @ Benchmark Reviews
Subject: Processors | April 30, 2013 - 02:04 PM | Josh Walrath
Tagged: amd, FX, vishera, bulldozer, FX-6350, FX-4350, FX-6300, FX-4300, 32 nm, SOI, Beloved
Today AMD has released two new processors that address the AM3+ market. The FX-6350 and FX-4350 are two new refreshes of the quad and hex core lineup of processors. Currently the FX-8350 is still the fastest of the breed, and there is no update for that particular number yet. This is not necessarily a bad thing, but there are those of us who are still awaiting the arrival of the rumored “Centurion”.
These parts are 125 watt TDP units, which are up from their 95 watt predecessors. The FX-6350 runs at 3.9 GHz with a 4.2 GHz boost clock. This is up 300 MHz stock and 100 MHz boost from the previous 95 watt FX-6300. The FX-4350 runs at 3.9 GHz with a 4.3 GHz boost clock. This is 100 MHz stock and 300 MHz boost above that of the FX-4300. What is of greater interest here is that the L3 cache goes from 4 MB on the 4300 to 8 MB on the 4350. This little fact looks to be the reason why the FX-4350 is now a 125 watt TDP part.
It has been some two years since AMD started shipping 32 nm PD-SOI/HKMG products to the market, and it certainly seems as though spinning off GLOBALFOUNDRIES has essentially stopped the push to implement new features into a process node throughout the years. As many may remember, AMD was somewhat famous for injecting new process technology into current nodes to improve performance, yields, and power characteristics in “baby steps” type fashion instead of leaving the node as is and making a huge jump with the next node. Vishera has been out for some 7 months now and we have not really seen any major improvement in regards to performance and power characteristics. I am sure that yields and bins have improved, but the bottom line is that this is only a minor refresh and AMD raised TDPs to 125 watts for these particular parts.
The FX-6350 is again a three module part containing six cores. Each module features 2 MB of L2 cache for a total of 6 MB L2 and the entire chip features 8 MB of L3 cache. The FX-4350 is a two module chip with four cores. The modules again feature the same 2 MB of L2 cache for a total of 4 MB active on the chip with the above mentioned 8 MB of L3 cache that is double what the FX-4300 featured.
Perhaps soon we will see updates on FM2 with the Richland series of desktop processors, but for now this refresh is all AMD has at the moment. These are nice upgrades to the line. The FX-6350 does cost the same as the FX-6300, but the thinking behind that is that the 6300 is more “energy efficient”. We have seen in the past that AMD (and Intel for that matter) does put a premium on lower wattage parts in a lineup. The FX-4350 is $10 more expensive than the 4300. It looks as though the FX-6350 is in stock at multiple outlets but the 4350 has yet to show up.
These will fit in any modern AM3+ motherboard with the latest BIOS installed. While not an incredibly exciting release from AMD, it at least shows that they continue to address their primary markets. AMD is in a very interesting place, and it looks like Rory Read is busy getting the house in order. Now we just have to see if they can curve back their cost structure enough to make the company more financially stable. Indications are good so far, but AMD has a long ways to go. But hey, at least according to AMD the FX series is beloved!
Subject: General Tech | April 30, 2013 - 01:23 PM | Jeremy Hellstrom
Tagged: Steamroller, piledriver, Kaveri, Kabini, hUMA, hsa, GCN, bulldozer, APU, amd
AMD may have united GPU and CPU into the APU but one hurdle had remained until now, the the non-uniformity of memory access between the two processors. Today we learned about one of the first successful HAS projects called Heterogeneous Uniform Memory Access, aka hUMA, which will appear in the upcoming Kaveri chip family. The use of this new technology will allow the on-die CPU and GPU to access the same memory pool, both physical and virtual and any data passed between the two processors will remain coherent. As The Tech Report mentions in their overview hUMA will not provide as much of a benefit to discrete GPUs, while they will be able to share address space the widely differing clock speeds between GDDR5 and DDR3 prevent unification to the level of an APU.
Make sure to read Josh's take as well so you can keep up with him on the Podcast.
"At the Fusion Developer Summit last June, AMD CTO Mark Papermaster teased Kaveri, AMD's next-generation APU due later this year. Among other things, Papermaster revealed that Kaveri will be based on the Steamroller architecture and that it will be the first AMD APU with fully shared memory.
Last week, AMD shed some more light on Kaveri's uniform memory architecture, which now has a snazzy marketing name: heterogeneous uniform memory access, or hUMA for short."
Here is some more Tech News from around the web:
- AMD’s new heterogeneous Uniform Memory Access
- hUMA; AMD’s Heterogeneous Unified Memory Architecture @ Hardware Canucks
- Compro TN50W Cloud Network Camera @ Tweaktown
- Wifi Pineapple project uses updated hardware for man-in-the-middle attacks @ Hack a Day
- New OpenWRT Drops Support For Linux 2.4, Low-Mem Devices @ Slashdot
- HP mashes up ProLiant, Integrity, BladeSystem, and Moonshot server @ The Register
- Acer selling tablet using Intel Y series processor @ The Register
- CERN Celebrates 20 Years of an Open Web (and Rebuilds 1st Web Page) @ Slashdot
- BitFenix 5K YouTube Subscriber Giveaway @ eTeknix
heterogeneous Uniform Memory Access
Several years back we first heard AMD’s plans on creating a uniform memory architecture which will allow the CPU to share address spaces with the GPU. The promise here is to create a very efficient architecture that will provide excellent performance in a mixed environment of serial and parallel programming loads. When GPU computing came on the scene it was full of great promise. The idea of a heavily parallel processing unit that will accelerate both integer and floating point workloads could be a potential gold mine in wide variety of applications. Alas, the promise of the technology did not meet expectations when we have viewed the results so far. There are many problems with combining serial and parallel workloads between CPUs and GPUs, and a lot of this has to do with very basic programming and the communication of data between two separate memory pools.
CPUs and GPUs do not share common memory pools. Instead of using pointers in programming to tell each individual unit where data is stored in memory, the current implementation of GPU computing requires the CPU to write the contents of that address to the standalone memory pool of the GPU. This is time consuming and wastes cycles. It also increases programming complexity to be able to adjust to such situations. Typically only very advanced programmers with a lot of expertise in this subject could program effective operations to take these limitations into consideration. The lack of unified memory between CPU and GPU has hindered the adoption of the technology for a lot of applications which could potentially use the massively parallel processing capabilities of a GPU.
The idea for GPU compute has been around for a long time (comparatively). I still remember getting very excited about the idea of using a high end video card along with a card like the old GeForce 6600 GT to be a coprocessor which would handle heavy math operations and PhysX. That particular plan never quite came to fruition, but the idea was planted years before the actual introduction of modern DX9/10/11 hardware. It seems as if this step with hUMA could actually provide a great amount of impetus to implement a wide range of applications which can actively utilize the GPU portion of an APU.
Subject: Graphics Cards, Processors | January 23, 2013 - 02:42 PM | Ryan Shrout
Tagged: southern islands, sony, ps4, playstation 4, orbis, Kaveri, bulldozer, APU, amd
Earlier today a report from Kotaku.com posted some details about the upcoming PlayStation console, code named Orbis and sometimes just called the PS4. Kotaku author Luke Plunkett got the information from a 90 page PDF that details the development kit so the information is likely pretty accurate if incomplete. It discusses a new controller and a completely new accounts system but I was mostly interested in the hardware details given.
We'll begin with the specs. And before we go any further, know that these are current specs for a PS4 development kit, not the final retail console itself. So while the general gist of the things you see here may be similar to what makes it into the actual commercial hardware, there's every chance some—if not all of it—changes, if only slightly.
This is key to keep in mind because here are the specs listed on the report:
- 8GB of system memory
- 2.2GB of graphics memory
- 4 module (8 core) AMD Bulldozer CPU
- AMD "R10xx" based GPU
- 4x USB 3.0 ports and 2x Ethernet connections
- Blu-ray drive
- 160GB HDD
- HDMI and optical audio output
We are essentially talking about an AMD FX-series processor with a Southern Islands based discrete card and I am nearly 100% sure that this will not match the configuration of the shipping system. Think about it - would a console developer really want to have a processor that can draw more than 100 watts inside its box in addition to a discrete GPU? I doubt it.
Instead, let's go with the idea that this developer kit is simply meant to emulate some final specifications. More than likely we are looking at an APU solution that combines Bulldozer or Steamroller cores along with GCN-based GPU SIMD arrays. The most likely candidate is Kaveri, a 28nm based product that meets both of those requirements. Josh recently discussed the future with Kaveri in a post during CES, worth checking out. AMD has told us several times that Kaveri should be able to hit the 1.0 TFLOPs level of performance and if we compare to the current discrete GPUs would enable graphics performance similar to that of an under-clocked Radeon HD 7770.
There is some room for doubt though - Kaveri isn't supposed to be out until "late Q4" though its possible that the PS4 will be the first customer. It is also possible that AMD is making a specific discrete GPU for implementation on the PS4 based on the GCN architecture that would be faster than the graphics performance expected on the Kaveri APU.
When speaking with our own Josh Walrath on this rumor, he tended to think that Sony and AMD would not use an APU but would rather combine a separate CPU and GPU on a single substrate, allowing for better yields than a combined APU part. In order to make up for the slower memory controller interface (on substrate is not as fast as on-die) AMD might again utilize backside cache, just like the one used on the Xbox 360 today. With process technology improvements its not unthinkable to see that jump to 30 or 40MB of cache.
With the debate of a 2013 or 2014 release still up in the air, there is plenty of time for this to change still but we will likely know for sure after our next trip to Taipei.