Digging into a specific market

This is the last Bulldozer-based design. How will this work going forward?

A little while ago, I decided to think about processor design as a game. You are given a budget of complexity, which is determined by your process node, power, heat, die size, and so forth, and the objective is to lay out features in the way that suits your goal and workload best. While not the topic of today's post, GPUs are a great example of what I mean. They make the assumption that in a batch of work, nearby tasks are very similar, such as the math behind two neighboring pixels on the screen. This assumption allows GPU manufacturers to save complexity by chaining dozens of cores together into not-quite-independent work groups. The circuit fits the work better, and thus it lets more get done in the same complexity budget.

Carrizo is aiming at a 63 million unit per year market segment.

This article is about Carrizo, though. This is AMD's sixth-generation APU, starting with Llano's release in June 2011. For this launch, Carrizo is targeting the 15W and 35W power envelopes for $400-$700 USD notebook devices. AMD needed to increase efficiency on the same, 28nm process that we have seen in their product stack since Kabini and Temash were released in May of 2013. They tasked their engineers to optimize their APU's design for these constraints, which led to dense architectures and clever features on the same budget of complexity, rather than smaller transistors or a bigger die.

15W was their primary target, and they claim to have exceeded their own expectations.

Backing up for a second. Beep. Beep. Beep. Beep.

When I met with AMD last month, I brought up the Bulldozer architecture with many individuals. I suspected that it was a quite clever design that didn't reach its potential because of external factors. As I started this editorial, processor design is a game and, if you can save complexity by knowing your workload, you can do more with less.

Bulldozer looked like it wanted to take a shortcut by cutting elements that its designers believed would be redundant going forward. First and foremost, two cores share a single floating point (decimal) unit. While you need some floating point capacity, upcoming workloads could use the GPU for a massive increase in performance, which is right there on the same die. As such, the complexity that is dedicated to every second FPU can be cut and used for something else. You can see this trend throughout various elements of the architecture.

And of course, there was a few instances that they were a bit too aggressive. Josh Walrath, who normally covers AMD products and CPU architectures for us, mentioned that early Bulldozer parts were not able to keep the dual integer units fed with Fetch and Decode in all situations. AMD did not mention anything in particular, but the phrase “Sometimes you get a product back and say 'Yeah, it actually would have been nice to have two of these'” came up in response to my questions. It happens, but it gets smoothed out over time, and we're talking about its fourth generation with Carrizo's “Excavator” cores. We're talking history at this point, but I feel it leads to an important mindset.

The main design choices seem to have pointed toward a universe where developers embrace GPU compute and optimize toward it. This aligns with their core architecture, their spearheading of the HSA initiative, their interest as a company in GPU development, and so forth.

It made sense, too. Tim Sweeney, head of Epic Games, expected (in 2008) that the generation after Unreal Engine 3 would be written in “a real programming language” that could be executed on the GPU, rather than DirectX and OpenGL. A year later, which would be 2009, he noted that developing a GPGPU application (at the time) required about ten-fold more effort, and that is not worth the added control. Software developers were eying in that direction, and AMD was already working on tools and languages. Now, of course, Mantle led to DirectX 12 and Vulkan, so the trajectory has changed again, and graphics APIs will probably be at the center of it for the foreseeable future.

Will you dig it?

The point I want to highlight is that hardware architects can do a lot with optimizing for their workloads. This is at the center of Carrizo. AMD has picked several use cases, lumped them together into a single product, and told their engineers to focus their design on it. Carrizo is aimed at the $400 to $700 (USD) laptop segment, which is where the bulk of sales occur. AMD tried for the biggest gains at the 15W segment with Excavator, but they also kept 35W in mind.

This leads to specific uses. Gaming is first and foremost for AMD, which I will get to in a minute, but I will lead with video decoding. H.265, also known as HEVC, looks like it will be the major new format for video. Carrizo includes a dedicated HEVC decoder, which also supports H.264, MPEG2, DivX, and other formats of course, but HEVC is the new addition. This should provide a sharp reduction in power consumption as well as smooth playback for the videos it targets (hence why it targets them).

They also shaved power consumption by scaling and post processing the video as it is delivered to the display. This keeps the GPU powered down and, more interestingly, cuts the system memory access as the frames would otherwise need to enter and leave the GPU. AMD claims that this accounts for a half of a watt in savings during video playback, which is a lot considering that Kaveri used just under 5W total and Carrizo is listed at just under 2W (for 1080p content). Carrizo is advertised as supporting up to 4K video.

This feature is not just for video, either. It can apparently be used by any application, including rendered content such as video games. This just came up as a brief comment mid-keynote, so I cannot elaborate on this.

Gaming is front and center in the Carrizo launch, too, but that is expected. AMD is one of the top two PC graphics companies. This part is designed to target mainstream games, such as DOTA 2, League of Legends, and Counter-Strike: Global Offensive maxed out at 1080p with at least 30 FPS. Remember, of course, that this is at a 15W design point for $400 to $700 laptops. It will also support AMD Dual Graphics (formerly Hybrid Crossfire) for some of AMD's discrete, mobile GPUs to give it a boost into a comfortable user experience.

The GPU performance is listed at 819 GFLOPs, which puts it in the ballpark of desktop Kaveri (856 GFLOPs) and is based on the same, third-generation GCN cores that you see integrated in Tonga. Specifically, it includes eight of them, which add up to 512 shader units (versus 1792 of the Tonga-based, R9 285 discrete GPU). This enables FreeSync for notebooks as well as TrueAudio and, of course, the next generation graphics APIs: Mantle, DirectX 12, and Vulkan. It will also support "OpenCL 2.x". I am uncertain what this means for OpenCL 2.1 specifically, when and if, but developers have been waiting for a platform to begin programming on so it is worth keeping an eye out for drivers.

As we have talked about over the last few months, Carrizo is also the first HSA 1.0-supporting processor. This is useful for applications that switch heavily between serial and parallel workloads. AI visibility and path-finding are two such tasks, but those will likely not be useful for Carrizo's application in video games, because the integrated GPU is probably not just sitting idle while a discrete GPU does the heavy lifting. It will lower computation time for applications that use the GPU for general compute though, because the data will not need to be moved or copied between compute devices and contexts can schedule work more efficiently. This is utilized in recent versions of Java and Python, which could be useful for developers and users of enterprise software.

Looking forward to a moment of Zen

When you are stuck on a process node, you need to find other ways to increase performance. It would be silly to start with a clean slate for each product and come up with the most efficient circuit possible for each target workload at the time. There is always something you can change or clean up, and those new or modified elements can be carried over into the future. Some can even be carried over to AMD's endeavors with ARM processors. Designers at AMD would say that their job is about finding the correct problems for the engineers to revisit, and the second attempt will always be better than the first. The correct problem is the one that will yield the biggest increases in performance, or decreases in power consumption, for the effort they can afford to give it.

Carrizo is a design aimed at a specific type of user. It is the last of the Bulldozer architecture, leaving room for the Zen architecture to back away from the shared FPU model. The work done on Carrizo will still carry over though, especially in how it integrates a bunch of parts on chip and does so with a significant reduction in die real estate, considering the complexity. Next time, they can optimize elsewhere and do so with a bigger budget, due to smaller fabrication processes and different power requirements.

For now, it will try to make a dent in the mid-range, power-efficient products, focusing on what people actually use those devices for. As always, we will need to wait and see our own benchmarks to quantify this for ourselves.