High Bandwidth Memory

AMD shared it plans and details of its first HBM implementation with us. Come see what kind of memory system will power Fiji!

UPDATE: I have embedded an excerpt from our PC Perspective Podcast that discusses the HBM technology that you might want to check out in addition to the story below.

The chances are good that if you have been reading PC Perspective or almost any other website that focuses on GPU technologies for the past year, you have read the acronym HBM. You might have even seen its full name: high bandwidth memory. HBM is a new technology that aims to turn the ability for a processor (GPU, CPU, APU, etc.) to access memory upside down, almost literally. AMD has already publicly stated that its next generation flagship Radeon GPU will use HBM as part of its design, but it wasn’t until today that we could talk about what HBM actually offers to a high performance processor like Fiji. At its core HBM drastically changes how the memory interface works, how much power is required for it and what metrics we will use to compare competing memory architectures. AMD and its partners started working on HBM with the industry more than 7 years ago, and with the first retail product nearly ready to ship, it’s time to learn about HBM.

We got some time with AMD’s Joe Macri, Corporate Vice President and Product CTO, to talk about AMD’s move to HBM and how it will shift the direction of AMD products going forward.

The first step in understanding HBM is to understand why it’s needed in the first place. Current GPUs, including the AMD Radeon R9 290X and the NVIDIA GeForce GTX 980, utilize a memory technology known as GDDR5. This architecture has scaled well over the past several GPU generations but we are starting to enter the world of diminishing returns. Balancing memory performance and power consumption is always a tough battle; just ask ARM about it. On the desktop component side we have much larger power envelopes to work inside but the power curve that GDDR5 is on will soon hit a wall, if you plot it far enough into the future. The result will be either drastically higher power consuming graphics cards or stalling performance improvements of the graphics market – something we have not really seen in its history.

While it’s clearly possible that current and maybe even next generation GPU designs could still have depended on GDDR5 as the memory interface, the move to a different solution is needed for the future; AMD is just making the jump earlier than the rest of the industry.

But GDDR5 also limits GPU designs and graphics card designs in another way: form factor. Implementing a high performance GDDR5 memory interface requires a large number of chips to reach the required bandwidth levels. Because of that, PCB real estate becomes a concern and routing those traces and chips on a board becomes complicated. And the wider the GPU memory interface (256-bit, 384-bit), the more board space is taken up for the memory implementation. As frequencies increase and power draw goes up on GDDR5, the need for larger voltage regulators becomes a concern.

This diagram provided by AMD shows the layout of the GPU and memory chips required to get the rated bandwidth for the graphics card. Even though the GPU die is a small portion of that total area, the need to surround the GPU by 16 DRAM chips, all equidistant from their GPU PHY locations, takes time, engineering and space.

Another potential concern is that GDDR5 memory performance scaling above where we reside today will cause issues with power. More bandwidth requires more power and DRAM power consumption is not linear; you see a disproportionate increase in power consumption as the bandwidth level rises. As GPUs increase compute rates and games demand more pixels for larger screens and higher refresh rates, the demand for more memory bandwidth is not stabilizing and certainly isn’t regressing. Thus a move to HBM makes sense, today.

Historically, when technology comes to an inflection point like this, we have seen the integration of technologies on the same piece of silicon. In 1989 we saw Intel move cache and floating point units onto the processor die, in 2003 AMD was the first to merge the north bridge and memory controller into a design, then graphics, the south bridge even voltage regulation – they all followed suit.

But on-chip integration of DRAM is problematic. The process technology used for GPUs and high performance processors differs greatly from that use on DRAM chips, traditionally. Density of transistors for a GPU is not nearly at the level of density for DRAM and thus putting both on the same piece of silicon would degrade the maximum quality and performance (or power consumption) of both. It might be possible to develop a process technology that does work with both at the same level as current integrations but that would drive up production cost – something all parties would like to avoid.

The answer for HBM is an interposer. The interposer is a piece of silicon that both the memory and processor reside on, allowing the DRAM to be in very close proximity to the GPU/CPU/APU without being on the same physical die. This close proximity allows for several very important characteristics that give HBM the advantages it has over GDDR5. First, this proximity allows for extremely wide communication bus widths.  Rather than 32-bits per DRAM we are looking at 1024-bits for a stacked array of DRAM (more on that in minute). Being closer to the GPU also means the clocks that regulate data transfer between the memory and processor can be simplified, and slower, to save power and complication of design. As a result, the proximity of the memory means that the overall memory design and architecture can improve performance per watt to an impressive degree.

Integration of the interposer also means that the GPU and the memory chips themselves can be made with different process technologies. If AMD wants to use the 28nm process for its GPU but wants to utilize 19nm DRAM, it can do that. The interposer itself, also made of silicon, can be built on a much larger and more cost efficient process technology as well. For AMD’s first interposer development it will have no active transistors and essentials acts like a highway for data to move from one logic location to another: memory to GPU and back. At only 100 microns thick, the interposer will not add much to the z-height of the product and with tricks like double exposures you can build a interposer big enough for any GPU and memory requirement. As an interesting side note, AMD’s Joe Macri did tell me that the interposer is so thin that holding it in your fingers will result in a sheet-of-paper-like flopping.

AMD’s partnerships with ASE, Ankor and UMC are responsible for the manufacturing of this first interposer – the first time I have heard UMC’s name in many years!

So now that we know what an interposer is and how it allows the HBM solution to exist today, what does the high bandwidth memory itself bring to the table? HBM is DRAM-based but was built with low power consumption and ultra wide bus widths in mind. The idea was to target a “wide and slow” architecture, one that scales up with high amounts of bandwidth and where latency wasn’t as big of a concern. (Interestingly, latency was improved in the design without intent.) The DRAM chips are stacked vertically, four high, with a logic die at the base. The DRAM die and logic die are connected to each other with through silicon vias, small holes drilled in the silicon that permit die to die communication at incredible speeds. Allyn taught us all about TSVs back in September of 2014 after a talk at IDF and if you are curious in how this magic happens, that story is worth reading.

Note: In reality the GPU die and the HBM stack are approximately the height

Where the HBM stack logic die meets the interposer, micro-bumps are used for a more traditional communication, power transfer and installation method. These pads are also used to connect the GPU/APU/CPU to the interposer and the interposer to the package substrate.

Moving the control logic of the DRAM to the bottom of the stack allows for better utilization of die space as well as allowing for closer proximity of the PHYs (the physical connection layer) of the memory to the matching PHYs on the GPU itself. This helps to save power and simplify design.

Each HBM memory stack of HBM 1 (more on that designation later) is comprised of four 256MB DRAMs for a total of 1GB of memory per stack. When compared to a single DRAM of GDDR5 (essentially a stack of one), the HBM offering changes specifications in nearly every way. The width of the HBM stack is now 1024-bits though clock speed is reduced substantially to 500 MHz. Even with GDDR5 is hitting clock speeds as high as 1750 MHz, the bus width offsets that change in favor of HBM, resulting in total memory bandwidth per stack of 128 GB/s, compared to 28 GB/s per chip on GDDR5. Because of the changes to clocking styles and rates, the HBM stacks can operate at 1.3v rather than 1.5v.

The first iteration of HBM on the flagship AMD Radeon GPU will include four stacks of HBM, a total of 4GB of GPU memory. That should give us in the area of 500 GB/s of total bandwidth for the new AMD Fiji GPU; compare that to the R9 290X today at 320 GB/s and you’ll see a raw increase of around ~56%. Memory power efficiency improves at an even great rate: AMD claims that HBM will result in more than 35 GB/s of bandwidth per watt of power consumption by the memory system while GDDR5 only gets over 10 GB/s.

Physical space savings are just as impressive for HBM over current GDDR5 configurations. 1GB GDDR5 DRAM chip takes about 28mm x 24mm of space on a PCB with all four 256MB packages laid out on the board. The 1GB HBM stack takes only 7mm x 5mm of space, a savings of 94% in terms of surface area. Obviously that HBM stack has to be placed on the interposer chip itself, not on the PCB of the graphics card, but the area saved is still accurate. Comparing the full implementation of Hawaii and 16 DRAM packages for GDDR5 to Fiji with its HBM configuration shows us why AMD was adamant that form factor changes were coming soon. What an HBM-enabled system with 4GB of system memory can do in under 4900 mm2 would take 9900 mm2 to implement with GDDR5 memory technology. It’s easy to see now why the board vendors and GPU designers are excited about new places that discrete GPUs could find themselves.

Besides the spacing consideration and bandwidth improvements, there are likely going to be some direct changes to the GPUs that integrate support for HBM. Die size of the GPU should go down to some degree because of the memory interface reduction. With more simplistic clocking mechanisms and lower required clock rates, as well as with much finer pitches coming in through the GPUs PHY, integration of memory on an interposer can change die requirements for memory connections. Macri indicated that it would be nearly impossible for any competent GPU designer to build a GPU that doesn’t save die space with a move to HBM over GDDR5.

Because AMD isn’t announcing a specific product using HBM today, it’s hard to talk specifics, but the question of total power consumption improvements was discussed. Even though we are seeing drastic improvements in memory system power consumption, the overall effect on the GPU will be muted somewhat as the total power draw a memory controller on a GPU is likely under 10% of the total. Don’t expect a 300 watt GPU that was built on GDDR5 to translate into a 200 watt GPU with HBM. Also interesting, Macri did comment that the HBM DRAM stacks will act as a heatsink for the GPU, allowing the power dissipation of the total package and heat spreader to improve. I don’t think this will mean much in the grand scheme of high performance GPUs but it may help AMD deal with power consumption concerns that have plagued them in the last couple of generations.

Moving to a GPU platform with more than 500 GB/s of memory bandwidth gives AMD the opportunity to really improve performance in key areas were memory utilization are at their peak. I would assume that we would see 4K and higher resolution performance improvements over the previous generation GPUs where memory bandwidth is crucial. GPGPU applications could also see performance scaling above what we normally see as new GPU generations release.

An obvious concern is the limit of 4GB of memory for the upcoming Fiji GPU – even though AMD didn’t verify that claim for the upcoming release, implementation of HBM today guarantees that will be the case. Is this enough for a high end GPU? After all, both AMD and NVIDIA have been crusading for larger and larger memory capacities including AMD’s 8GB R9 290X offerings released last year. Will gaming suffer on the high end with only 4GB? Macri doesn’t believe so; mainly because of a renewed interest in optimizing frame buffer utilization. Macri admitted that in the past very little effort was put into measuring and improving the utilization of the graphics memory system, calling it “exceedingly poor.” The solution was to just add more memory – it was easy to do and relatively cheap. With HBM that isn’t the case as there is a ceiling of what can be offered this generation. Macri told us that with just a couple of engineers it was easy to find ways to improve utilization and he believes that modern resolutions and gaming engines will not suffer at all from a 4GB graphics memory limit. It will require some finesse from the marketing folks at AMD though…

 

The Future

High bandwidth memory is clearly the future of high performance GPUs with both AMD and NVIDIA integrating it relatively soon. AMD’s Fiji GPU will include it this quarter and NVIDIA’s next-generation Pascal architecture will use it too, likely released in 2016. NVIDIA will have to do a bit of management of expectations with AMD being the first out the gate and AMD will be doing all it can to tout the advantages it offers over GDDR5. And there are plenty.

HBM has been teased for a long time…

I’ll be very curious how long it takes HBM to roll out to the entire family of GPUs from either company. The performance advantages high bandwidth memory offers come at some additional cost, at least today, and there is no clear roadmap for getting HBM to non-flagship level products. AMD and the memory industry see HBM as a wide scale adoption technology and Macri expects to see not only other GPUs using it but HPC applications, servers, APUs and more. Will APUs see an even more dramatic and important performance increase when they finally see HBM implemented on them? With system memory as the primary bottleneck for integrated GPU performance it’s hard to not see that being the case.

When NVIDIA gets around to integrating HBM we’ll have another generational jump to HBM 2 (cleverly named). The result will be stacks of 4GB each and bandwidth increases along the same multiplier.  That would alleviate any concerns over memory capacities on GPUs using HBM and improve the overall bandwidth story yet again; and all of that will be available in the next calendar year. (AMD will integrate HBM 2 at that time as well.)

AMD has sold me on HBM for high end GPUs, I think that comes across in this story. I am excited to see what AMD has built around it and how this improves their competitive stance with NVIDIA. Don’t expect to see dramatic decreases in total power consumption with Fiji simply due to the move away from GDDR5, though every bit helps when you are trying to offer improved graphics performance per watt. How a 4GB limit to the memory system of a flagship card in 2015-2016 will pan out is still a question to be answered but the additional bandwidth it provides offers never before seen flexibility to the GPU and software developers.

June everyone. June is going to be the shit.