AMD's Raja Koduri talks moving past CrossFire, smaller GPU dies, HBM2 and more.

Subject: Graphics Cards | March 15, 2016 - 02:02 AM |
Tagged: vulkan, raja koduri, Polaris, HBM2, hbm, dx12, crossfire, amd

After hosting the AMD Capsaicin event at GDC tonight, the SVP and Chief Architect of the Radeon Technologies Group Raja Koduri sat down with me to talk about the event and offered up some additional details on the Radeon Pro Duo, upcoming Polaris GPUs and more. The video below has the full interview but there are several highlights that stand out as noteworthy.

  • Raja claimed that one of the reasons to launch the dual-Fiji card as the Radeon Pro Duo for developers rather than pure Radeon, aimed at gamers, was to “get past CrossFire.” He believes we are at an inflection point with APIs. Where previously you would abstract two GPUs to appear as a single to the game engine, with DX12 and Vulkan the problem is more complex than that as we have seen in testing with early titles like Ashes of the Singularity.

    But with the dual-Fiji product mostly developed and prepared, AMD was able to find a market between the enthusiast and the creator to target, and thus the Radeon Pro branding was born.

    View Full Size

    Raja further expands on it, telling me that in order to make multi-GPU useful and productive for the next generation of APIs, getting multi-GPU hardware solutions in the hands of developers is crucial. He admitted that CrossFire in the past has had performance scaling concerns and compatibility issues, and that getting multi-GPU correct from the ground floor here is crucial.
     

  • With changes in Moore’s Law and the realities of process technology and processor construction, multi-GPU is going to be more important for the entire product stack, not just the extreme enthusiast crowd. Why? Because realities are dictating that GPU vendors build smaller, more power efficient GPUs, and to scale performance overall, multi-GPU solutions need to be efficient and plentiful. The “economics of the smaller die” are much better for AMD (and we assume NVIDIA) and by 2017-2019, this is the reality and will be how graphics performance will scale.

    Getting the software ecosystem going now is going to be crucial to ease into that standard.
     

  • The naming scheme of Polaris (10, 11…) has no equation, it’s just “a sequence of numbers” and we should only expect it to increase going forward. The next Polaris chip will be bigger than 11, that’s the secret he gave us.

    There have been concerns that AMD was only going to go for the mainstream gaming market with Polaris but Raja promised me and our readers that we “would be really really pleased.” We expect to see Polaris-based GPUs across the entire performance stack.
     

  • AMD’s primary goal here is to get many millions of gamers VR-ready, though getting the enthusiasts “that last millisecond” is still a goal and it will happen from Radeon.
     
  • No solid date on Polaris parts at all – I tried! (Other than the launches start in June.) Though Raja did promise that after tonight, he will only have his next alcoholic beverage until the launch of Polaris. Serious commitment!
     
  • Curious about the HBM2 inclusion in Vega on the roadmap and what that means for Polaris? Though he didn’t say it outright, it appears that Polaris will be using HBM1, leaving me to wonder about the memory capacity limitations inherent in that. Has AMD found a way to get past the 4GB barrier? We are trying to figure that out for sure.

    View Full Size

    Why is Polaris going to use HBM1? Raja pointed towards the extreme cost and expense of building the HBM ecosystem prepping the pipeline for the new memory technology as the culprit and AMD obviously wants to recoup some of that cost with another generation of GPU usage.

Speaking with Raja is always interesting and the confidence and knowledge he showcases is still what gives me assurance that the Radeon Technologies Group is headed in the correct direction. This is going to be a very interesting year for graphics, PC gaming and for GPU technologies, as showcased throughout the Capsaicin event, and I think everyone should be looking forward do it.


March 15, 2016 | 02:49 AM - Posted by Prodeous

I was thinking about HBM1 and 4GB limit, and how potentially it could be bypassed on Polaris.

looking at the layout, there is plenty of room to have 6 dies next to the CPU instead of 4. So if they re-engineered the memory controller and the sub-strait, then it is quite feasible that they can have a 6GB unit, allowing them to use the 1GB dies.

This in turn would also give them a 50% boost in the memory bandwidth. I also hope they release it at 600Mhz, giving it even more. but that is just wishful thinking.

Still if they want enthusiasts to be serious abut AMD products they have to release greater then 4GB cards with Polaris.

March 15, 2016 | 05:43 AM - Posted by Anonymous (not verified)

HBM2 supports up to 32GB with only 4 cubes, exactly the same layout they used in HBM1. it has the config of 1, 2 , 4, 8 GB per cube , and a config of 2/4/6 cubes on the same chip.

March 15, 2016 | 06:15 AM - Posted by Anonymous (not verified)

The die size of HBM2 is much larger than HBM1. It is not a drop in replacement. They can relatively easily achieve 6 or 8 GB of memory by going with more stacks of HBM1. I It may be the case that 6 channels (6x1024) will provide sufficient bandwidth (768 GB/s or more). If they do go this route though, I hope that they go with something that can deliver 8 GB of capacity. Six GB is probably sufficient right now, but may not be going forward.

March 15, 2016 | 02:28 PM - Posted by Master Chen (not verified)

Maybe I did miss something, but...how is "HBM2's die is much larger" if it's STACKED VERTICALLY? It's supposed to be exactly same footprint as first-gen HBM, just more layers. And according to technology itself, it should rise only in the height, NOT width. It's LITERALLY a multi-layered cake. All FinFET does is just provides availability for bigger memory capacities due to cramming more on the same space thanks to shrink of transistors via smaller tech-process, so how adding two/four/six/eight more layers ATOP the interposer makes it "much larger"? Please elaborate on sane and worthy-enough-of-reading explanation, if you can at all.

March 15, 2016 | 10:11 PM - Posted by Anonymous (not verified)

See the article here:

http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification

HBM2 packs a lot more memory on each die, in fact it is 8x more. They are 8 Gb die vs. 1 Gb die. I would also assume that the TSVs (through silicon via) would take more area. If you double the number of chips in the stack then you need more area for TSVs to communicate with them. The TSVs go through the die from top to bottom, so they must take more area. HBM2 is not just more HBM1 chips stacked. HBM1 can not be stacked over 4 high; it does not have the TSV routing channels.

Bottom line, HBM1 packages are about 40 mm2 while HBM2 packages are about 92 mm2 each. The packages are going to be very close to the die size since the packaging for placing die on an interposer is minimal. Also, do not assume that the memory is manufactures on the same process as the GPU. That is one of the advantages of a silicon interposer. Process tech optimized for DRAM is quite different from process tech optimized for logic. The stacked memory allows for the memory die to be made on a process optimized for DRAM while the bottom logic die with the interface can be made in a completely different process. I believe the article states that Samsung's HBM2 is made on their 21 nm DRAM process. I don't know what the bottom logic die is. As long as the TSV micro-balls line up, it doesn't matter.

This would be a significant issue if GPUs where still going to be almost 600 mm2 like Fiji. Four 40 mm2 die is only 160 mm2 but four 92 mm2 die will take around 400 mm2 or more with wasted space from die spacing and shape. 14 nm GPUs will be much smaller.

March 16, 2016 | 11:04 AM - Posted by Anonymous (not verified)

A few of the details are not correct there. It was a quick post from my phone. It looks like The HBM spec may not define much internal to the stack, so it may be possible for SK Hynix to make 8 high HBM1 stacks. Also, I believe current HBM1 is 2 Gb die so the planned HBM2 is 4x the capacity per memory die, not 8x. Anyway, that is probably not part of the specification either, so they could possibly update the capacity to 4 Gb per die if they can stay within the footprint specified by the HBM1 spec. Increases in the number of stacked die would also need to stay within the specified footprint. If the currently shipping die do not have TSV routing channels for 8 high stacks, then this could increase package size.

March 15, 2016 | 05:48 AM - Posted by mon2d-d (not verified)

http://cdn.gpu.news/2016/03/HBM2-Hynix-1-1280x958.png

March 15, 2016 | 06:05 AM - Posted by Anonymous (not verified)

The Fiji die is 596 square mm, which is gigantic. The upcoming 14 nm parts will be significantly smaller. They should have plenty of room for 6 or even 8 stacks of HBM1. It may not be a bad idea to stick with HBM1 for this generation. While HBM2 sounds great, I think it will be quite expensive. The die size of HBM1 is about 40 square mm; HBM2 more than doubles this up to about 92 square mm. Four stacks of HBM2 will take around 400 square mm on the interposer. There will be some wasted space between the die and some wasted space due to the shape of the die. I expect that the GPU will be limited to 300 to 400 square mm with HBM2. The upcoming 14 and 16 nm parts may be limited to a much smaller size anyway due to yields. I am wondering if they will be more in the 200 to 300 square mm range.

March 15, 2016 | 09:05 AM - Posted by Anonymous (not verified)

I don't see any reason why they would be limited to 4 stacks of HBM1. They will not be limited by the interposer size for a 14 nm GPU. It is a big jump from 28 nm planar to 14 nm FinFET. I could wouldn't be surprised if the big die part is half the size of the Fiji die.

March 15, 2016 | 10:16 AM - Posted by loguerto (not verified)

You could fit 6 HBM1 memory chips next to a over 500mm^2 die, polaris would probably get there in a few years so better wait for HBM2. :)

March 15, 2016 | 09:31 PM - Posted by Anonymous (not verified)

I doubt we will see any die get close to 500 mm2 at 14 nm or below. The yields will just be too low for a commercial product. Future GPUs will most likely be much smaller; you will just get multiple GPU die on a silicon interposer.

May 18, 2016 | 03:00 PM - Posted by Maurice Fortin (not verified)

they cannot do HBM1 with more then 4 dies, the interposer was not able to be built to handle it A and B it not really need it, there has been no case where it ran into a memory wall that I have seen, so they do not or have not needed to cram 6-8-12gb on it, so am sure it probably a limitation OR they realized they did not need do this and were better able to keep bill of materials lower, and who knows, maybe by them doing more memory the :loading of it: i.e 1080p-1200p performance would have been hampered worse, am sure there are many reasons, but the simple one is, there have been VERY FEW cases where 4+ GB of Vram have been truly needed(and more memory more that is "hidden" via system which of course if system not up to snuff more Vram kills performance not helps it)

March 15, 2016 | 02:54 AM - Posted by darknas-36

I wonder if its possible to introduce HBM memory for onboard GPU that would definitely kill off the budget discrete graphics card. But then again that is wishful thinking though you would need an entirely different motherboard to use HBM memory.

March 15, 2016 | 06:25 AM - Posted by Anonymous (not verified)

You could have an APU with HBM on an interposer. These are planned devices; I believe they have shown slides showing an APU with HBM stack. It is unclear if they will make such a device in a socket though. They would mostly be for mobile. If they did make such a device, then I don't see why they couldn't use it in a standard motherboard. The HBM could easily be configured for graphics use only or as an L4 cache. It wouldn't be much different from what Intel does with their Crystalwell mobile chips. They have an extra memory chip on the package with up to 128 MB of DRAM to act as a cache for the integrated GPU.

March 15, 2016 | 04:05 AM - Posted by Anonymous (not verified)

Potentially a GPU could move to a two-stage memory architecture: 1-2gb of on-interposer HBM, with a larger bank of on-PCB GDDR memory running at a slower clock speed to save power. The downside is complication in memory management for the driver (and for DX12 and Vulkan, for the game developer) and that die area needs to be dedicated for the extra memory controller. The upside is you can produce multiple cards of varying memory capacity from a single chip.

March 15, 2016 | 06:31 AM - Posted by Anonymous (not verified)

That is an interesting idea. If the HBM was set up as a cache, then it could be completely invisible to software, so that is probably not an issue. I suspect it will be simpler to just use more HBM stacks though. HBM1 die size is tiny, so with a much smaller GPU (Fiji on 28 nm is gigantic), they could easily fit 6 or even 8 stacks. If they design the memory controller for upto 8 channels, then they could make 4, 6, or 8 channel devices with 4, 6, or 8 GB capacity.

March 16, 2016 | 11:36 AM - Posted by Anonymous (not verified)

They may just be able to stack more die or increase the capacity of each die. Just increasing the capacity of each die is probably the simplest solution.

May 18, 2016 | 03:03 PM - Posted by M (not verified)

totally different for wiring, power consumption, performance etc, GPU is built to use X memory, far as I know, the ONLY gpu or whatever that were made to use differing memory like this has been consoles, and some more custom processors(Xbox/Xbox1 and Intel Iris based) am quite sure the cost/complexity would not at all be worth it or likely AMD would have done so by now :D

March 15, 2016 | 04:42 AM - Posted by Laughing Man (not verified)

So seeing "Navi Scalability Nexgen Memory" makes me think of a few possibilities.

1. Further investment in HBM allowing vastly increased amount of memory on graphics cards.

2. This is when we will see AMD try to push a unified memory architecture for their GPU's and CPU's.

3. They might be putting R&D towards using 3d Xpoint (or something similar) allowing them to scale to much larger memory sets and once again this could be the key to a unified memory architecture.

Obviously I'm just taking wild stabs in the dark here but I think eventually one way or another GPU and CPU operating from the same memory set is going to be a thing just a matter of when.

March 15, 2016 | 05:56 AM - Posted by nFUn (not verified)

#1 check!
they can ship cards with 16GB of HBM2 this year. 32GB will be only in 2017...
http://cdn.gpu.news/2016/03/HBM2-Hynix-1-1280x958.png

March 15, 2016 | 08:22 AM - Posted by Anonymous (not verified)

It seems to me that AMD just indicated that they will not be shipping HBM2 parts this year. An article over at wccftech indicated that Samsung started their 4 GB HBM2 ramp in Q1 2016 while SK Hynix will not start their HBM2 ramp until Q3 2016. It was reported that Samsung had begun mass production in January. If they are dependent on SK Hynix production starting late this year, then that isn't enough time to get GPUs out until 2017.

AMD will presumably be using SK Hynix while Nvidia will be using Samsung HBM2. I don't know if this is true or why SK Hynix would be behind Samsung if it is. AMD partnered with SK Hynix to create HBM, but perhaps their HBM2 production is delayed due to process technology. HBM2 uses 8 Gb per die rather than 1 Gb, so it presumably requires smaller process tech. The capacity is much larger, but the die size (package size) only goes from ~40 mm2 for HBM1 to ~92 mm2 for HBM2.

This is the anandtech article talking about the jedec standard and Samsungs mass production announcement:

http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification

March 15, 2016 | 07:51 AM - Posted by Master Chen (not verified)

Rumors that are flying en masse throughout the internet as of this very moment, heavily indicate it's highly likely to be HMC, rather than 3DXP. The structure of HMC itself fits very well under that description of AMD's.

March 15, 2016 | 08:43 AM - Posted by Anonymous (not verified)

There has been talk that the power consumption of HBM will become an issue going forward as they continue to scale the bandwidth up. I believe one of the wccftech articles about the AMD event covers this. While HMC is interesting technology, as far as I know, it takes more power than HBM. Also, it would require a ridiculous number of channels to match the bandwidth provided by HBM. It is only 20 to 30 GB/s per channel read bandwidth. This sounds great compared to DDR4, but not that great compared to HBM. You would need something like 35 channels to match the bandwidth of 4 HBM2 stacks. HMC is designed to operate through a PCB, not a silicon interposer. Plus, HMC is not a jedec standard.

I suspect Navi will be multiple smaller GPUs on an interposer, possibly with distributed memory. The memory will almost certainly be based on some form of wide IO rather than HMC style interconnect. HMC is the exact opposite of wide IO. There was an extremetech article (I believe) comparing Wide IO, HBM, and HMC a while ago. Wide IO is similar to HMB except it is aimed at lower power solutions, like cell phone SOCs. With how limiting power consumption will be, lower power standards may need to be applied more widely. Perhaps they will actually stack the GPU on top of a memory stack.

March 15, 2016 | 02:38 PM - Posted by Master Chen (not verified)

Maybe "HMC 2.0", or something. I'm really skeptical on the possibility of it turning out 3DXP, though. Because that technology is waaay too fresh/raw AND expensive as F at the same time due to it being Intel-proprietary - even Intel has problems with funding the advanced development of it's own technology, just imagine how much Intel would try to milk from third parties (and ESPECIALLY it's main rival on the global market) by selling the 3DXP license to them. I might turn out wrong about this in the end, but mark my words as of right now: first consumer-grade 3DXP-based products, if/when they'll come out at all, will be costing leg-and-a-half WHILE Intel also will be heavily milking third parties through selling licenses, which in itself heavily inflates the initial prices. In such of a case, which we'll highly likely to see, 3DXP-based products will be costing as much as FOUR equivalent HBM/HBM2-based products. It's Intel, after all...

March 15, 2016 | 04:34 PM - Posted by Anonymous (not verified)

"Intel would try to milk from third parties (and ESPECIALLY it's main rival on the global market) by selling the 3DXP license to them"

Intel does not wholely own the Xpoint IP as Micron is a co-developer, more impotently Intel, and Micron have to offer their Xpoint products for sale on a fair market basis, and if they sale to one they have to sale to all. Micron will be marketing its Xpoint across the marketplace and so will Intel. Both Intel and Micron are under no means forced to share their IP for making Xpoint, but once they begin selling the Xpoint memory to device makers and/or licensing the technology to third parties then they have to offer all the other parties in the marketplace equal access to purchase Xpoint and use it in their devices.

Just as you will see AMD motherboards using Thunderbolt controller chips, you will see Both Intel and Micron selling Xpoint to any and all third parties, for sure Intel and Micron will not turn down any business from AMD or any other third parties interested in Intel's/Micron's NVM products they are in business to make sales and profits on those sales!

Intel will not be able to gouge the third party Xpoint market, as Micron will be there to counter with their own deals.

March 15, 2016 | 04:36 PM - Posted by Anonymous (not verified)

edit: more impotently
to: more importantly

March 15, 2016 | 10:50 PM - Posted by Anonymous (not verified)

First of all, x-point is not suitable for GPU memory in any way, so I don't know where that comes from. Second, HMC is an Intel/Micron product. It is not a JEDEC standard. You think AMD is going to purchase it from Intel or Micron instead of using the memory technology that they helped develop specifically for GPUs?

HMC is also not suitable for a consumer level GPU. Would they use 36 HMC chips to get 1 TB/s instead of just using 4 HBM2 stacks? HMC is simply not a competitor to HBM. They could be complementary if Intel is willing to adopt HBM. HBM can't deliver large capacities that are possible with HMC. If you needed something like 1TB of capacity connected to a single chip, then HMC could deliver this by chaining multiple stacks. HBM could be used more like a cache in that case. Such set-ups are a niche product though. It will be more power efficient than DDR4, so it has use in servers and HPC. Although, it is not replaceable like DDR4 since it is board mounted.

HMC is designed to function through a PCB. It would be a complete waste of power and die area to use on a silicon interposer. I don't see any one abandoning silicon interposers. They allow a revolutionary increase in connectivity. I wouldn't be surprised if Intel invents some technology surprisingly similar to HBM instead of just using standard HBM memory though.

March 16, 2016 | 12:09 AM - Posted by Anonymous (not verified)

Really, HBM is scaling up its capacity, so who is to say that HBM will not have the ability for more than 8GB per stack, and the current JEDEC standard could be amended to go with even wider than 1024 traces to each HBM stack and allow for more bandwidth without increasing memory clocks. HBM stacks will be made with larger and larger amounts of memory per HBM stack, and larger numbers of traces to each stack! HBM will definitely be getting revisions to allow for much more capacity and the JEDEC standard only covers what it takes to make a single stack, so there is no reason to limit the amounts of stacks to only 4 per processor as even 6 or 8 stacks could be accommodated. HBM stacks are made using 28nm, and HBM2 using 20nm, so smaller nodes will allow increasing amounts per stack.

March 16, 2016 | 05:41 PM - Posted by Master Chen (not verified)

Um...HBM2 is supposed to use 14nm and 16nm. It's 20nm only on mobile SoCs and with some of the TSMC's offerings Nvidia going to use. The HBM2 that's currently being developed and manufactured by GF, is 14nm/16nm.

March 16, 2016 | 07:05 PM - Posted by Anonymous (not verified)

No, HBM2 will be made on a memory process at 20nm. The 14/16nm finfet nodes are for logic, not memory. Nobody has announced a finfet based memory process, but I am not entirely sure it would help that much because of the capacitors that DRAM are based around.

March 16, 2016 | 07:08 PM - Posted by Anonymous (not verified)

GF is not designing or producing HBM of any type. GF does not manufacture memory at all. Samsung and SK Hynix will be making HBM(2) and making it on a 20nm memory process.

March 17, 2016 | 06:27 AM - Posted by Master Chen (not verified)

Hynix is a part of GF. What are you smoking, people? GF currently develops for AMD.

March 17, 2016 | 05:29 PM - Posted by Anonymous (not verified)

Hynix isn't associated with GF in any way. GF used to be AMD's fab's but now they are a 3rd party fab just like TSMC. As I said, GF does not manufacture memory of any type. Maybe you should pass me some of what you are smoking?!

March 17, 2016 | 08:15 PM - Posted by Anonymous (not verified)

HBM has a limited ability to scale capacity since it must be placed on an interposer of limited size. The interposers are made with chip manufacturing technology; they are basically very large 65 nm (at the moment) chips. HMC is a board mounted technology; each stack is in a separate BGA package which is mounted on the board. The stacks can be chained, so you can have more than one stack connected on each channel. This can allow for massive capacity out of something like a 4 or 8 channel device. Each channel is similar electrically to a pci-e link. They are 8 or 16-bit differential serial links, so having something like 8 of them in a high end device is doable. This will not compete with HBM bandwidth though.

HBM can scale in bandwidth much more easily than capacity. A high end device could use more than 4 stacks to achieve ridiculous amounts of bandwidth. HMC can scale capacity more easily than bandwidth. HBM capacity will probably be sufficient for almost any consumer needs. HMC will be for HPC applications where you may be dealing with many terabytes of data.

March 16, 2016 | 05:39 PM - Posted by Master Chen (not verified)

>3DXP is not suitable for GPU memory in any way
Where did you get that idea from? There is very little factual information about 3DXP as of right now, and nothing is certain yet. HMC can be used in graphics, and 3DXP resembles HMC's structure a bit, so where are you getting the idea that it can't be used with discrete graphics? If my memory is right, it should be possible to use this similarly to how DRAM is used, so...even if this, by some strange reason, wouldn't be present in discrete graphical accelerators, I still can see it being implemented at least in integrated solutions (a.k.a. APUs/Intel's IG).

March 16, 2016 | 07:07 PM - Posted by Anonymous (not verified)

It's not suitable because it has a limited amount of write cycles, sure far higher than flash, but GPU memory is written to a TON. I mean to put 512GB a sec through 4GB of ram means all of the data is erased 128 times per second.

March 17, 2016 | 06:28 AM - Posted by Master Chen (not verified)

Again - where did you get that "limited amount of write cycles" from?

March 17, 2016 | 05:32 PM - Posted by Anonymous (not verified)

Right from Intel. Check out this piece: http://www.anandtech.com/show/9470 -- as I said, it's a fair bit higher than NAND, but not essentially unlimited which you would need for a GPU's memory pool. Also the non volatile aspect of it brings no benefits in this scenario, plus it's slower than DRAM.

March 17, 2016 | 07:45 PM - Posted by Anonymous (not verified)

X-point is a flash replacement. It can not replace DRAM on a graphics card. The most likely use for it would be for a seldomly written database. You can store a massive database in the x-point and get much higher read speed than a flash device without needing to store the entire database in DRAM. You would still have DRAM for frequently accessed items and as a write cache. X-point is supposed to be cheaper than DRAM but significantly more expensive than flash.

As I already stated, HMC is suitable replacement for DDR4, but not GDDR5. Intel already places a bit of DRAM on their CPU packages with Crystalwell to feed the IGP, but this is only 128 MB or less. This acts as a graphics cache. I could see them using a stack or two of HMC instead. This would be better than DDR4 alone, but it will not compete with a GPU with HBM. HMC can only deliver 20 to 40 GB/s per channel. HBM can deliver 128 GB/s per stack and HBM2 can deliver 256 GB/s per stack.

HMC will be more expensive and take more power since it would take huge numbers of channels to match HBM bandwidth. It is quite power efficient for how fast it operates, but it is still going to take more power than HBM. It is driving that 20 to 40 GB/s over a 16-bit interface. That interface has to go through 2x the number of solder connections compared to HBM and probably 10x the distance (mm vs. cm).

March 15, 2016 | 04:48 AM - Posted by J-Man (not verified)

Ryan,

You're the Barbara Walters of interviewing GPU Execs!

Great watch as always.

March 15, 2016 | 07:13 AM - Posted by Anonymous (not verified)

I have been saying for a while that 14 or 16 nm GPUs are going to have much more limited die sizes due to yields. We are definitely not going to see gigantic GPUs like Fiji going forward. The software needs to be able to handle rendering at a much finer granularity. If you are placing multiple GPUs on an interposer, you don't want to have to connect 8 GB of memory to each GPU and end up with both sets of memory holding the same data. This is what we have now with multi-GPU on DX11, and it is really inefficient.

I don't know what multi-GPU designs on an interposer are going to look like though. If you can have the GPUs working on separate parts of the problem in smaller chunks, then you don't need anywhere near as much memory per GPU. This progably depends too much on the software though.

Perhaps they could set it up like multi-socket CPU systems, where each CPU has a memory controller and a link to other CPUs which is fast enough to actually share memory. The nvlink tech that Nvidia has talked about is much faster than pci-e, but it is not fast enough to allow GPUs to share memory effectively. On a silicon interposer though, such high speed interconnect may be achievable. If that is the case, then you could just connect a single stack to each GPU and all GPUs could access memory attached to other GPUs. The GPUs would need to be much smaller though. Space on the interposer is limited and HBM2 is 92 square mm per stack. They may be able to fit 2 or 3 GPUs with such a configuration. Four seems like it would be too much, unless the GPUs are really small, maybe around 100 square mm. At that size. The interconnect would probably use up too large a proportion of the die area.

March 15, 2016 | 07:49 AM - Posted by Master Chen (not verified)

>No HBM2 this year
BOOOOOOOOOOO~!!!11

March 15, 2016 | 08:46 AM - Posted by Anonymous (not verified)

As long as their high end card supports more than 4 GB, I don't think it is an issue.

March 15, 2016 | 02:43 PM - Posted by Master Chen (not verified)

For GDDR5X to actually be ANY relevant AT ALL, their LOW tier (read "cheap-ass") offering needs to have at least 4GB of it, while mid/high tier needs to be AT LEAST 8GB. And NO MORE "Rebrandeons". Unfortunately, something tells me that's not going to happen with AMD, because of their financial situation. They simply don't have enough funds for that.
On the other hand, Nvidia has enough funding to pull something like this off, when their GDDR5X-based cards start coming out...meaning that, in the most nearest future, we will be seeing if Nvidia actually tries to abuse this opportunity, or not. Personally, I really don't see why they'd won't do that.

March 15, 2016 | 11:09 PM - Posted by Anonymous (not verified)

We will have to wait and see how well AMD's 14 nm parts fill out their product stack. With how big of a jump 14 nm parts are compared to 28 nm parts, I suspect that all of the 28 nm parts will be left far behind. With two different die, they should be able to make quite a few different products.

I think it will be quite clear when Nvidia releases their high-end parts. You pay a price premium for the Nvidia brand as it is. I would definitely buy an R9 390 or 390x with 8 GB of memory over a 970 with 3.5 GB of memory. I suspect that Nvidia's HBM2 part will not even be a consideration. It is probably going to be ridiculously expensive.

It is still unclear whether these next generation parts will be GDDR5x at all. Even if they are "just" high speed GDDR5 (actual GDDR5 or GDDR5x in some comparability mode), AMD is already delivering 8 GB of GDDR5 for a good price with the 390 based parts. Since there is such a big push for VR, I suspect the minimum for an actual gaming card will be 4 GB. Probably only really low end cards will be less. The low-end market has been disappearing anyway since IGPs have gotten so much better. There will not be much of a market for a 2 GB card.

March 16, 2016 | 06:33 AM - Posted by Master Chen (not verified)

Why buy an R9 390X with 8GB of memory, when you can buy a "somewhat cheaper" R9 290X with 8GB of memory? It's essentially the same thing, but cheaper. Yes, it's "older", but performs roughly the same AND it's cheaper due to being "older". There's really no reasonable enough point to get R9 390X instead of R9 290X (ESPECIALLY in the case if both have 8GB on their board), IMHO.

March 16, 2016 | 11:08 AM - Posted by Anonymous (not verified)

Can you find any 290x with 8 GB for sale? It originally shipped with 4 GB, right? I would assume that they are not making them any more.

March 16, 2016 | 05:53 PM - Posted by Master Chen (not verified)

Sapphire is the one who mainly does the 8GB R9 290Xs. The Vapor-X and Tri-X. I dunno where you live, but there's a ton of them still on sale in Europe, Russia, and Asian regions. The cheapest retail "mint box" one goes for roughly 312$ in my country, while currently most cheapest R9 390X 8GB offering is being sold at 420$, so...y-yeahhh... :\

March 17, 2016 | 07:06 PM - Posted by Anonymous (not verified)

At Newegg in the USA, I see one R9 290x with 8 GB for $350 and that is it. There are no regular 290s, or 4 GB 290x in stock. Might be able to find some on Amazon or some smaller component sellers. They definitely don't seem common here, and at $350, there isn't much reason to buy it. You can get a 390 or 390x for a similar price. There are 390s on newegg for less that $350 (seem to start at $320) and the cheapest 390x (an XFX) looks like it is $370 right now. The point is, I wouldn't even consider buying a 3.5 GB Nvidia 970 when I can get an 8 GB 290 or 390. The cheapest 970 on Newegg is about the same price as the 390.

May 18, 2016 | 03:08 PM - Posted by M (not verified)

when many of these companies have amazingly strict NDA, we DO NOT know, no matter who says what, all it takes is a bit of $$ and things are hidden away to make competition not know better, it happens all the time, especially when you reference billions of $ that are gained or lost with 1 mis-spoken word or action, and in this case, massive trade secrets, when you are at the bleeding edge tech wise(Which HBM-nodes etc are) there is some serious reasons for them all to keep VERY tight controls, or they could be out of business tommorow sort of speak, that is fact.

March 15, 2016 | 08:05 AM - Posted by Master Chen (not verified)

"Pro Duo, a brand we've never heard of before"? Ryan, are you smoking Jeremy's weed again, by any chance? TUL been doing cards under the "Pro Duo" mark (inside their PowerColor in-brand) FOR YEARS. Given it might be a different "Pro Duo" here (personally, I still believe they've called it that simply because TUL's PowerColor was the first official OEM for that reference FX2 card they've put in those FNW's Tikis), the naming is still exactly the same, so saying "this is the very first time we've ever heard of that" is just..unprofessional. :\

March 16, 2016 | 07:12 PM - Posted by Anonymous (not verified)

Dude, he was obviously talking about that being the first time AMD used that branding themselves ...

March 15, 2016 | 08:22 AM - Posted by Visitor (not verified)

I wonder if Polaris 10 and 11 are "completely" different. I mean, how different could it be?

March 15, 2016 | 08:53 AM - Posted by Anonymous (not verified)

Polaris 10 is probably small die with GDDR5 memory controller while Polaris 11 is probably a larger die with an HBM controller. Die sizes will probably be quite small compared to 28 nm sizes though. Fiji is a huge die, but that is most likely only possible since it is made on a process that has been tweaked for the last 5 years.

March 15, 2016 | 08:56 AM - Posted by Anonymous (not verified)

Polaris 10 is probably small die with GDDR5 memory controller while Polaris 11 is probably a larger die with an HBM controller. Die sizes will probably be quite small compared to 28 nm sizes though. Fiji is a huge die, but that is most likely only possible since it is made on a process that has been tweaked for the last 5 years.

March 15, 2016 | 10:17 AM - Posted by Despoiler

It's the other way around. 10 is the big die and 11 is the smaller. Last night was the first time the bigger die 10 has been demo'd.

May 18, 2016 | 03:18 PM - Posted by M (not verified)

probably as different as Bulldozer/Piledriver derive cpu-APU have been, related as they are same family, but possibly could be stuff such as memory controllers, GCN cores etc, even such as P10 could be GF making and Samsung produces P11 only, which would/could have a significant difference in leakage/clock speeds/power etc. AMD is far from stupid company, and the "alliance" of GF/Samsung/Hynix and the like has extreme vested interests in making them as good as they can and likely have spent a massive amount of time/resources to understand all the ins and outs to make P10/P11 "fit" in the market they will occupy as best as they are able to have done.

I am just thinking of APU/CPU alone, those derived from bulldozer/piledriver/excavator/steamroller though from same family the "small die" and "big die/core" are substantially different in numerous ways, no different then Nvidia for example in the 700-800-900 series cards which are using different "cores" that although in same family are substantially different levels of power and such, so IMO I see this as being the exact same thing, and likely far more different route then AMD has ever taken for GPU derived products, maybe dealing with the PCU crud they had to deal with, they learned best to have one focused for max raw performance decent power and other best power decent performance, it worked extremely well for Nvidia with their 750Ti did it not(max efficiency with reasonable performance)

March 15, 2016 | 12:04 PM - Posted by onion uk (not verified)

cool vid, but my question is... what hot sauce was that? :)

March 15, 2016 | 01:15 PM - Posted by LooseNeutral (not verified)

all I know is we're just seein shades of what's to to come. NVid will have thier answer to respond. so be it. All in all It's good to just have live coverage. And the end of it with Ryan Wussing out on the hot hot sauce. Perfect! LOL. Ryan, you did a great Job with the Q&A. the ending was a beautiful thing. LOL. Kudos to Raja too. when he said.. he knew you would connect the dots. LMAO

March 15, 2016 | 01:19 PM - Posted by Anonymous (not verified)

No HMB2 for AMD is a disaster for them.

NVidia will have it and I'm sure it will be great.

March 15, 2016 | 03:00 PM - Posted by Master Chen (not verified)

Nvidia won't have it. They're going with GDDR5X, too. It's a "race of wait". Basically, what's happening right now is kind of like this: Nvidia - USA, Radeon - Russia. Russia has pulled itself out of the Syria after finishing it's peacemaking mission. Now Russia allows for USA to do it's move and thoroughly waits for them to make mistakes. It's a game of nerves and time. Who will be the first to make a major mistake? This is the main reason why AMD decided to skip HBM2 on Polaris - solely because Nvidia decided to utilize GDDR5X instead of moving to HBM right away. Both Polaris and GTX 1xxx will be on GDDR5X, and HBM2 battle for both companies would come only after that. If Nvidia wouldn't have moved to GDDR5X - Polaris would've been an HBM2 card, even if it'd costed AMD massive yields. AMD is fully ready to do HBM2, but they've deliberately decided to slow down a little bit and wait, while lying in an ambush. That's the kind of the game those two companies are playing right now. It almost literally became the "Russia VS USA" kind of a situation.

March 15, 2016 | 04:57 PM - Posted by Anonymous (not verified)

You are a Fool off on some drug induced tangent, it's AMD/SK Hynix with Hynix being the HBM/HBM2 supplier for AMD, and others after AMD's needs are met, and Samsung for Nvidia's and others for supplies of HBM2. So maybe the suppliers are at fault and supplies are not up to the point where AMD and Nvidia can receive a stable enough supply of HBM2 to commit to full scale production! Both AMD and Nvidia are probably not going to bank on there being enough supply of HBM2 in the pipeline to risk any product delays for their Polaris and Pascal offerings in 2016.

So AMD and Nvidia will not push their luck and suffer any product delays they will go with GDDR5/5X and wait until there is sufficient HBM2 supplies to move to using it in without any fears of supply interruptions! HBM1 is probably going to be limited to where it is currently being used by AMD and AMD can justify HBM's bandwidth/power savings relative to HBM's high cost(Flagship only products). GDDR5/5X provides sufficient bandwidth for both AMD's and Nvidia's non flagship GPU lines so they will stick with the tried and true until HBM's economy of scale and supply stability says otherwise over the next few years.

March 15, 2016 | 06:39 PM - Posted by Master Chen (not verified)

Nice try, but no cigar.

March 16, 2016 | 12:13 AM - Posted by Anonymous (not verified)

Nice try but you are the most egregious fool! Both Nvidia and AMD will not have HBM2 this year, and Nvidia is behind on async compute!

March 16, 2016 | 12:01 PM - Posted by Anonymous (not verified)

nVidia will have an HBM2 Pascal this year, but it will be their high end and expensive Pascal.

While it might be great that AMD can say they got to market first with HBM1, that's all they can really say. AMD continues a downward spiral.

March 17, 2016 | 06:32 AM - Posted by Master Chen (not verified)

1. Pascal won't be coming out this year, it will be just introduced this year. These are not the same things.
2. Nvidia goes GDDR5X as of right now. There will be no HBM2 on both of the sides, until Vega comes out.
3. Vega comes out earlier than Pascal, because Pascal will be only introduced this year, not released.

March 17, 2016 | 06:20 PM - Posted by extide

We may see a GP104 released this year, in-fact I would count on it, but yeah, probably not "Big Pascal" / GP100. GP104 will NOT have HBM, though, it will be GDDR5X.

March 15, 2016 | 11:41 PM - Posted by Anonymous (not verified)

Nvidia may have an HBM2 part this year, but it will probably be another Titan X. In fact, if it comes out this year, it will probably be even more outlandishly expensive than a Titan X. If that is the case, then it will essentially be almost entirely for the publicity. Tech sites will review it. People will talk about it way to much. Almost no one will actually buy it, but it will be great publicity for the lower end GDDR based variants. This is how the graphics card market has worked for a while. If AMD can deliver a HBM1 based part with more than 4 GB of memory and at a reasonable price, then that will be a real product.

March 15, 2016 | 11:46 PM - Posted by Anonymous (not verified)

I think it will only be an issue if they are limited to 4 GB still. I also think that if Nvidia has an HBM2 part, it will be so expensive that it will be irrelevant to most of the market. It will be great marketing if they can pull it off though.

March 16, 2016 | 12:04 PM - Posted by Anonymous (not verified)

nVidia has proven time and time again how expensive a part does not play a factor. There will always be people who want to pay for the best.

March 16, 2016 | 01:41 PM - Posted by Anonymous (not verified)

Yes, such parts are great for marketing, but I doubt that they sell in high enough volume to be anything but marketing. Nvidia had a similar thing with their ridiculously expensive dual card. AMD just used a water cooler so that they didn't have to use cherry picked parts. Nvidia's part was marketing BS for reviewers to write articles about while AMD delivered a real product. It t was the best value in the price range for quite some time.

March 15, 2016 | 01:21 PM - Posted by LooseNeutral (not verified)

umm. to finish.. none of us, you, can have it all now, or this year. Believe or not. They just told us this is the way it is. SO go ahead and buy nvid, they don't care. That's what I got from this. attitude. and I love it. NO, what's being relaesed is not for you whinning ass gamers. What's being released is for you to whine about latter.... the game devs fault. the Framed it perfect!

March 16, 2016 | 09:38 AM - Posted by rl (not verified)

Hey Ryan. Could you pls clarify what you wrote here:

"The next Polaris chip will be bigger than 11, that’s the secret he gave us. "

In the interview Raja only seems to imply that the next Polaris chip will have a higher number (like 12) not that the chip will be bigger than 11. Did he clarify that after the interview or is that a speculation on your part?

March 16, 2016 | 07:10 PM - Posted by Anonymous (not verified)

Raja said literally, "bigger than 11" -- whether he meant the number was bigger (like 12) or the die would be bigger is unclear.

March 17, 2016 | 09:17 AM - Posted by rl (not verified)

You literally just repeated what I wrote? How is that in any way helpful?

March 17, 2016 | 06:32 PM - Posted by Anonymous (not verified)

I am not the previous anonymous poster. Someone else said that Polaris 10 is actually the big die part and that Polaris 11 is the small die. If that is the case, then he may have just been referring to 10 being bigger than 11. I am not sure why they would have assigned the code names like that unless 10 was designed first or something. I would have assumed that 11 would be the big die part.

These code names are not particularly relevant anyway. How many people still refer to the Fury chip by the code name "Fiji"? Also, the big die part may not be that big. Die sizes may be quite limited on 14 nm due to yields. The bigger part should have HBM though, and the HBM controller is actually supposed to take less die area than a GDDR5 controller. This will allow more room for compute resources.

March 17, 2016 | 07:05 PM - Posted by extide

What I was saying is that I don't think Ryan CAN clarify because that is all Raja said.

March 16, 2016 | 11:30 AM - Posted by Anonymous (not verified)

There are a couple possibilities for getting more than 4 GB out of HBM1. They could us more stacks. Also, I don't think that the internals of the stacks are actually part of the specification. They may be able to increase the number of stacked die to 8. They also may be able to increase the capacity of each die, perhaps going from 2 Gb to 4 Gb. They would just need the die to be small enough to fit within the specified footprint. The 8 Gb die Samsung is using for HBM2 are, I assume, larger than what the HBM1 spec would allow. If they could just double the capacity to 8 GB though, then that would be plenty for this generation. Sixteen GB is massive overkill.

A lot of people are taking SK Hynix's current implementation as used in the Fury cards as the limits of HBM1. I don't think that is the case, but I have not looked at the specifications myself. Going from HBM1 to HBM2, it looks like SK Hynix will switch from 28 nm class DRAM process down to 20 nm class process. The easiest, and probably most economical, way to increase the capacity of HBM1 is to increase the per die capacity to 4 Gb. This would allow for 2 GB, 4 high stacks for a capacity of 8 GB for 4 stacks. I don't know if they would need to make them on the smaller process to fit a 4 Gb die within the specified footprint though.

March 16, 2016 | 12:03 PM - Posted by Anonymous (not verified)

There is a reason the jump to HBM1 is going to happen fast. HBM1 while interesting has too many limitations in its current form.

HBM2+ is where the focus is going, not making HBM1 better.

March 16, 2016 | 01:51 PM - Posted by Anonymous (not verified)

You are not getting what I am saying. This would be like if DDR4 came out in 4 GB modules first and everyone assumed that only 4 GB modules were possible. HBM1 can be made in other configurations without re-inventing the wheel. It looks like they can increase the number of die in the stack, increase the capacity of the die, just use more stacks, or some combination of both. I don't know what they are doing, but they obviously need more than 4 GB. It would be great if they could go up to 6 stacks and also double the capacity. That would allow for 6 or 12 GB of capacity while also increasing bandwidth up to 768 GB/s. HBM2 is interesting, but probably only if you are in the market for a $1500 graphics card. You will pay a bleeding edge price for it. HBM1 based parts could be a lot more reasonable.

March 16, 2016 | 12:57 PM - Posted by rl (not verified)

HBM2 is a marketing term, established by hynix to differentiate between first gen production capabilities and the process upgrade which helps them increase capacity in the DIMM itself and the stack. They change the process node from 2xnm("HBM1") to 2znm(HBM2). Don't ask me why they chose that nomenclature.

The actual spec does not differentiate between 1 and 2, it already addresses higher capacities.

And while there are possibilities to increase the VRAM, they are simply not viable. More stacks means more I/O, means more PHYs, blowing up the GPU increasing die size and therefore cost.

March 16, 2016 | 02:21 PM - Posted by Anonymous (not verified)

HBM2 adds new features; it is not the same as HBM1.

See the Anandtech write-up here:

http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification

It increases the clock speed, adds pseudo-channel mode, and increases the allowed footprint of the package. They may be able to go up to 4 Gb with per die with the HBM1 footprint, but 8 Gb will be too large with current process tech. Samsung's HBM2 die are more than twice the size of HBM1 die, but they have 4x the capacity.

The memory controller for HBM is a relatively simple interface allowed by short traces (a few mm) and low clock speeds (only 500 MHz). This was one of the design goals of HBM since high speed GDDR5 interfaces were taking too much die area and too much power. The die size for adding another channel or two especially at 14 nm, is probably minimal.

Increasing the capacity per die from 2 Gb to 4 Gb is probably doable. They are obviously already making 8 Gb die for HBM2, or at least Samsung is. It is also possible to increase the number of die in a stack up to 8. This doesn't require any other changes as long as the current die support 8 layer TSV routing channels, which they probably do. If more bandwidth is required, then I suspect they could increase the clock speed slightly also. There is no reason that it would be strictly limited to 500 MHz.

It SK Hynix has improved HBM1 devices available now, then there really isn't much need to go HBM2 yet.

May 18, 2016 | 03:26 PM - Posted by M (not verified)

IF they cannot route the "wiring" for more per stack, if the interposer was built to only handle what it is currently handling, if the memory controller CANNOT handle a higher speed, there is for sure a reason why they released it as they did, maybe driving at say 600 Mhz instead causes it to have way more errors or power/temp to skyrocket, we do not know, we can only guess, am sure the engineers that make these things test the crap out of them, and are sold the way they are for many reasons, economies of scale very much apply, maybe that .1v needed to drive the higher speed with few errors results in a massive increase in failure rates etc, we simply do not know, and am sure Aandtech know many things, but are they directly making these things, I doubt it :)

May 20, 2016 | 05:33 PM - Posted by Anonymous (not verified)

The nest gen memory can be Tezzaron DiRAM4 too. True 3d, up to 8tb/s bandwidth, 64gb density and 9ns latency.

May 20, 2016 | 05:33 PM - Posted by Anonymous (not verified)

The nest gen memory can be Tezzaron DiRAM4 too. True 3d, up to 8tb/s bandwidth, 64gb density and 9ns latency.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.