Subject: Memory | August 25, 2016 - 02:39 AM | Tim Verry
Tagged: TSV, SK Hynix, Samsung, hot chips, hbm3, hbm
Samsung and SK Hynix were in attendance at the Hot Chips Symposium in Cupertino, California to (among other things) talk about the future of High Bandwidth Memory (HBM). In fact, the companies are working on two new HBM products: HBM3 and an as-yet-unbranded "low cost HBM." HBM3 will replace HBM2 at the high end and is aimed at the HPC and "prosumer" markets while the low cost HBM technology lowers the barrier to entry and is intended to be used in mainstream consumer products.
As currently planned, HBM3 (Samsung refers to its implementation as Extreme HBM) features double the density per layer and at least double the bandwidth of the current HBM2 (which so far is only used in NVIDIA's planned Tesla P100). Specifically, the new memory technology offers up 16Gb (~2GB) per layer and as many as eight (or more) layers can be stacked together using TSVs into a single chip. So far we have seen GPUs use four HBM chips on a single package, and if that holds true with HBM3 and interposer size limits, we may well see future graphics cards with 64GB of memory! Considering the HBM2-based Tesla will have 16 and AMD's HBM-based Fury X cards had 4GB, HBM3 is a sizable jump!
Capacity is not the only benefit though. HBM3 doubles the bandwidth versus HBM2 with 512GB/s (or more) of peak bandwidth per stack! In the theoretical example of a graphics card with 64GB of HBM3 (four stacks), that would be in the range of 2 TB/s of theoretical maximum peak bandwidth! Real world may be less, but still that is many terabytes per second of bandwidth which is exciting because it opens a lot of possibilities for gaming especially as developers push graphics further towards photo realism and resolutions keep increasing. HBM3 should be plenty for awhile as far as keeping the GPU fed with data on the consumer and gaming side of things though I'm sure the HPC market will still crave more bandwidth.
Samsung further claims that HBM3 will operate at similar (~500MHz) clocks to HBM2, but will use "much less" core voltage (HBM2 is 1.2V).
Stacked HBM memory on an interposer surrounding a processor. Upcoming HBM technologies will allow memory stacks with double the number of layers.
HBM3 is perhaps the most interesting technologically; however, the "low cost HBM" is exciting in that it will enable HBM to be used in the systems and graphics cards most people purchase. There were less details available on this new lower cost variant, but Samsung did share a few specifics. The low cost HBM will offer up to 200GB/s per stack of peak bandwidth while being much cheaper to produce than current HBM2. In order to reduce the cost of production, their is no buffer die or ECC support and the number of Through Silicon Vias (TSV) connections have been reduced. In order to compensate for the lower number of TSVs, the pin speed has been increased to 3Gbps (versus 2Gbps on HBM2). Interestingly, Samsung would like for low cost HBM to support traditional silicon as well as potentially cheaper organic interposers. According to NVIDIA, TSV formation is the most expensive part of interposer fabrication, so making reductions there (and somewhat making up for it in increased per-connection speeds) makes sense when it comes to a cost-conscious product. It is unclear whether organic interposers will win out here, but it is nice to seem them get a mention and is an alternative worth looking into.
Both high bandwidth and low latency memory technologies are still years away and the designs are subject to change, but so far they are both plans are looking rather promising. I am intrigued by the possibilities and hope to see new products take advantage of the increased performance (and in the latter case lower cost). On the graphics front, HBM3 is way too far out to see a Vega release, but it may come just in time for AMD to incorporate it into its high end Navi GPUs, and by 2020 the battle between GDDR and HBM in the mainstream should be heating up.
What are your thoughts on the proposed HBM technologies?
Subject: Graphics Cards | April 5, 2016 - 02:13 AM | Tim Verry
Tagged: HPC, hbm, gpgpu, firepro s9300x2, firepro, dual fiji, deep learning, big data, amd
Earlier this month AMD launched a dual Fiji powerhouse for VR gamers it is calling the Radeon Pro Duo. Now, AMD is bringing its latest GCN architecture and HBM memory to servers with the dual GPU FirePro S9300 x2.
The new server-bound professional graphics card packs an impressive amount of computing hardware into a dual-slot card with passive cooling. The FirePro S9300 x2 combines two full Fiji GPUs clocked at 850 MHz for a total of 8,192 cores, 512 TUs, and 128 ROPs. Each GPU is paired with 4GB of non-ECC HBM memory on package with 512GB/s of memory bandwidth which AMD combines to advertise this as the first professional graphics card with 1TB/s of memory bandwidth.
Due to lower clockspeeds the S9300 x2 has less peak single precision compute performance versus the consumer Radeon Pro Duo at 13.9 TFLOPS versus 16 TFLOPs on the desktop card. Businesses will be able to cram more cards into their rack mounted servers though since they do not need to worry about mounting locations for the sealed loop water cooling of the Radeon card.
|FirePro S9300 x2||Radeon Pro Duo||R9 Fury X||FirePro S9170|
|GPU||Dual Fiji||Dual Fiji||Fiji||Hawaii|
|GPU Cores||8192 (2 x 4096)||8192 (2 x 4096)||4096||2816|
|Rated Clock||850 MHz||1050 MHz||1050 MHz||930 MHz|
|Texture Units||2 x 256||2 x 256||256||176|
|ROP Units||2 x 64||2 x 64||64||64|
|Memory||8GB (2 x 4GB)||8GB (2 x 4GB)||4GB||32GB ECC|
|Memory Clock||500 MHz||500 MHz||500 MHz||5000 MHz|
|Memory Interface||4096-bit (HBM) per GPU||4096-bit (HBM) per GPU||4096-bit (HBM)||512-bit|
|Memory Bandwidth||1TB/s (2 x 512GB/s)||1TB/s (2 x 512GB/s)||512 GB/s||320 GB/s|
|TDP||300 watts||?||275 watts||275 watts|
|Peak Compute||13.9 TFLOPS||16 TFLOPS||8.60 TFLOPS||5.24 TFLOPS|
AMD is aiming this card at datacenter and HPC users working on "big data" tasks that do not require the accuracy of double precision floating point calculations. Deep learning tasks, seismic processing, and data analytics are all examples AMD says the dual GPU card will excel at. These are all tasks that can be greatly accelerated by the massive parallel nature of a GPU but do not need to be as precise as stricter mathematics, modeling, and simulation work that depend on FP64 performance. In that respect, the FirePro S9300 x2 has only 870 GLFOPS of double precision compute performance.
Further, this card supports a GPGPU optimized Linux driver stack called GPUOpen and developers can program for it using either OpenCL (it supports OpenCL 1.2) or C++. AMD PowerTune, and the return of FP16 support are also features. AMD claims that its new dual GPU card is twice as fast as the NVIDIA Tesla M40 (1.6x the K80) and 12 times as fast as the latest Intel Xeon E5 in peak single precision floating point performance.
The double slot card is powered by two PCI-E power connectors and is rated at 300 watts. This is a bit more palatable than the triple 8-pin needed for the Radeon Pro Duo!
The FirePro S9300 x2 comes with a 3 year warranty and will be available in the second half of this year for $6000 USD. You are definitely paying a premium for the professional certifications and support. Here's hoping developers come up with some cool uses for the dual 8.9 Billion transistor GPUs and their included HBM memory!
Some Hints as to What Comes Next
On March 14 at the Capsaicin event at GDC AMD disclosed their roadmap for GPU architectures through 2018. There were two new names in attendance as well as some hints at what technology will be implemented in these products. It was only one slide, but some interesting information can be inferred from what we have seen and what was said in the event and afterwards during interviews.
Polaris the the next generation of GCN products from AMD that have been shown off for the past few months. Previously in December and at CES we saw the Polaris 11 GPU on display. Very little is known about this product except that it is small and extremely power efficient. Last night we saw the Polaris 10 being run and we only know that it is competitive with current mainstream performance and is larger than the Polaris 11. These products are purportedly based on Samsung/GLOBALFOUNDRIES 14nm LPP.
The source of near endless speculation online.
In the slide AMD showed it listed Polaris as having 2.5X the performance per watt over the previous 28 nm products in AMD’s lineup. This is impressive, but not terribly surprising. AMD and NVIDIA both skipped the 20 nm planar node because it just did not offer up the type of performance and scaling to make sense economically. Simply put, the expense was not worth the results in terms of die size improvements and more importantly power scaling. 20 nm planar just could not offer the type of performance overall that GPU manufacturers could achieve with 2nd and 3rd generation 28nm processes.
What was missing from the slide is mention that Polaris will integrate either HMB1 or HBM2. Vega, the architecture after Polaris, does in fact list HBM2 as the memory technology it will be packaged with. It promises another tick up in terms of performance per watt, but that is going to come more from aggressive design optimizations and likely improvements on FinFET process technologies. Vega will be a 2017 product.
Beyond that we see Navi. It again boasts an improvement in perf per watt as well as the inclusion of a new memory technology behind HBM. Current conjecture is that this could be HMC (hybrid memory cube). I am not entirely certain of that particular conjecture as it does not necessarily improve upon the advantages of current generation HBM and upcoming HBM2 implementations. Navi will not show up until 2018 at the earliest. This *could* be a 10 nm part, but considering the struggle that the industry has had getting to 14/16nm FinFET I am not holding my breath.
AMD provided few details about these products other than what we see here. From here on out is conjecture based upon industry trends, analysis of known roadmaps, and the limitations of the process and memory technologies that are already well known.
Subject: Graphics Cards | March 15, 2016 - 02:02 AM | Ryan Shrout
Tagged: vulkan, raja koduri, Polaris, HBM2, hbm, dx12, crossfire, amd
After hosting the AMD Capsaicin event at GDC tonight, the SVP and Chief Architect of the Radeon Technologies Group Raja Koduri sat down with me to talk about the event and offered up some additional details on the Radeon Pro Duo, upcoming Polaris GPUs and more. The video below has the full interview but there are several highlights that stand out as noteworthy.
- Raja claimed that one of the reasons to launch the dual-Fiji card as the Radeon Pro Duo for developers rather than pure Radeon, aimed at gamers, was to “get past CrossFire.” He believes we are at an inflection point with APIs. Where previously you would abstract two GPUs to appear as a single to the game engine, with DX12 and Vulkan the problem is more complex than that as we have seen in testing with early titles like Ashes of the Singularity.
But with the dual-Fiji product mostly developed and prepared, AMD was able to find a market between the enthusiast and the creator to target, and thus the Radeon Pro branding was born.
Raja further expands on it, telling me that in order to make multi-GPU useful and productive for the next generation of APIs, getting multi-GPU hardware solutions in the hands of developers is crucial. He admitted that CrossFire in the past has had performance scaling concerns and compatibility issues, and that getting multi-GPU correct from the ground floor here is crucial.
- With changes in Moore’s Law and the realities of process technology and processor construction, multi-GPU is going to be more important for the entire product stack, not just the extreme enthusiast crowd. Why? Because realities are dictating that GPU vendors build smaller, more power efficient GPUs, and to scale performance overall, multi-GPU solutions need to be efficient and plentiful. The “economics of the smaller die” are much better for AMD (and we assume NVIDIA) and by 2017-2019, this is the reality and will be how graphics performance will scale.
Getting the software ecosystem going now is going to be crucial to ease into that standard.
- The naming scheme of Polaris (10, 11…) has no equation, it’s just “a sequence of numbers” and we should only expect it to increase going forward. The next Polaris chip will be bigger than 11, that’s the secret he gave us.
There have been concerns that AMD was only going to go for the mainstream gaming market with Polaris but Raja promised me and our readers that we “would be really really pleased.” We expect to see Polaris-based GPUs across the entire performance stack.
- AMD’s primary goal here is to get many millions of gamers VR-ready, though getting the enthusiasts “that last millisecond” is still a goal and it will happen from Radeon.
- No solid date on Polaris parts at all – I tried! (Other than the launches start in June.) Though Raja did promise that after tonight, he will only have his next alcoholic beverage until the launch of Polaris. Serious commitment!
- Curious about the HBM2 inclusion in Vega on the roadmap and what that means for Polaris? Though he didn’t say it outright, it appears that Polaris will be using HBM1, leaving me to wonder about the memory capacity limitations inherent in that. Has AMD found a way to get past the 4GB barrier? We are trying to figure that out for sure.
Why is Polaris going to use HBM1? Raja pointed towards the extreme cost and expense of building the HBM ecosystem prepping the pipeline for the new memory technology as the culprit and AMD obviously wants to recoup some of that cost with another generation of GPU usage.
Speaking with Raja is always interesting and the confidence and knowledge he showcases is still what gives me assurance that the Radeon Technologies Group is headed in the correct direction. This is going to be a very interesting year for graphics, PC gaming and for GPU technologies, as showcased throughout the Capsaicin event, and I think everyone should be looking forward do it.
Fighting for Relevance
AMD is still kicking. While the results of this past year have been forgettable, they have overcome some significant hurdles and look like they are improving their position in terms of cutting costs while extracting as much revenue as possible. There were plenty of ups and downs for this past quarter, but when compared to the rest of 2015 there were some solid steps forward here.
The company reported revenues of $958 million, which is down from $1.06 billion last quarter. The company also recorded a $103 million loss, but that is down significantly from the $197 million loss the quarter before. Q3 did have a $65 million write-down due to unsold inventory. Though the company made far less in revenues, they also shored up their losses. The company is still bleeding, but they still have plenty of cash on hand for the next several quarters to survive. When we talk about non-GAAP figures, AMD reports a $79 million loss for this past quarter.
For the entire year AMD recorded $3.99 billion in revenue with a net loss of $660 million. This is down from FY 2014 revenues of $5.51 billion and a net loss of $403 million. AMD certainly is trending downwards year over year, but they are hoping to reverse that come 2H 2016.
Graphics continues to be solid for AMD as they increased their sales from last quarter, but are down year on year. Holiday sales were brisk, but with only the high end Fury series being a new card during this season, the impact of that particular part was not as great as compared to the company having a new mid-range series like the newly introduced R9 380X. The second half of 2016 will see the introduction of the Polaris based GPUs for both mobile and desktop applications. Until then, AMD will continue to provide the current 28 nm lineup of GPUs to the market. At this point we are under the assumption that AMD and NVIDIA are looking at the same timeframe for introducing their next generation parts due to process technology advances. AMD already has working samples on Samsung’s/GLOBALFOUNDRIES 14nm LPP (low power plus) that they showed off at CES 2016.
Subject: Graphics Cards, Memory | January 19, 2016 - 11:01 PM | Scott Michaud
Tagged: Samsung, HBM2, hbm
Samsung has just announced that they have begun mass production of 4GB HBM2 memory modules. When used on GPUs, four packages can provide 16GB of Video RAM with very high performance. They do this with a very wide data bus, which trade off frequency for transferring huge chunks. Samsung's offering is rated at 256 GB/s per package, which is twice what the Fury X could do with HBM1.
They also expect to mass produce 8GB HBM2 packages within this calendar year. I'm guessing that this means we'll see 32GB GPUs in the late-2016 or early-2017 time frame unless "within this year" means very, very soon (versus Q3/Q4). They will likely be for workstation or professional cards, but, in NVIDIA's case, those are usually based on architectures that are marketed to high-end gaming enthusiasts through some Titan offering. There's a lot of ways this could go, but a 32GB Titan seems like a bit much; I wouldn't expect that this affects the enthusiast gamer segment. It might mean that professionals looking to upgrade from the Kepler-based Tesla K-series might be waiting a little longer, maybe even GTC 2017. Alternatively, they might get new cards, just with a 16GB maximum until a refresh next year. There's not enough information to know one way or the other, but it's something to think about when more of it starts rolling in.
Samsung's HBM2 are compatible with ECC, although I believe that was also true for at least some HBM1 modules from SK Hynix.
Subject: Graphics Cards | January 11, 2016 - 06:05 PM | Sebastian Peak
Tagged: rumor, report, pascal, nvidia, HBM2, hbm, GP104
A delivery of GPUs and related test equipment from Taiwan to Banglore has led to speculation about NVIDIA's upcoming GP104 Pascal GPU.
Image via Zauba.com
How much information can be gleaned from an import shipping manifest (linked here)? The data indicates a chip with a 37.5 x 37.5 mm package and 2152 pins, which is being attributed to the GP104 based on knowledge of “earlier, similar deliveries” (or possible inside information). This has prompted members of the 3dcenter.org forums (German language) to speculate on the use of GDDR5 or GDDR5X memory based on the likelihood of HBM being implemented on a die of this size.
Of course, NVIDIA has stated that Pascal will implement 3D memory, and the upcoming GP100 will reportedly be on a 55 x 55 mm package using HBM2. Could this be a new, lower-cost part using the existing GDDR5 standard or the faster GDDR5X instead? VideoCardz and WCCFtech have posted stories based on the 3DCenter report, and to quote directly from the VideoCardz post on the subject:
"3DCenter has a theory that GP104 could actually not use HBM, but GDDR5(X) instead. This would rather be a very strange decision, but could NVIDIA possibly make smaller GPU (than GM204) and still accommodate 4 HBM modules? This theory is not taken from the thin air. The GP100 aka the Big Pascal, would supposedly come in 55x55mm BGA package. That’s 10mm more than GM200, which were probably required for additional HBM modules. Of course those numbers are for the whole package (with interposer), not just the GPU."
All of this is a lot to take from a shipping record that might not even be related to an NVIDIA product, but the report has made the rounds at this point so now we’ll just have to wait for new information.
Subject: Graphics Cards | November 18, 2015 - 01:22 PM | Sebastian Peak
Tagged: Sapphire TriXX, R9 Fury X, overclocking, hbm, amd
The new version (5.2.1) of Sapphire's TriXX overclocking utility has been released, and it finally unlocks voltage and HBM overclocking for AMD's R9 Fury X.
(Image credit: Sapphire)
Previously the voltage of the R9 Fury X core was not adjustable, leaving what would seem to be quite a bit of untapped headroom for the cards which shipped with a powerful liquid-cooling solution rated for 500 watts of thermal dissipation. This should allow for much better results than what Ryan was able to achieve when he attempted overclocking for our review of the R9 Fury X in June (without the benefit of voltage adjustments):
"My net result: a clock speed of 1155 MHz rather than 1050 MHz, an increase of 10%. That's a decent overclock for a first attempt with a brand new card and new architecture, but from the way that AMD had built up the "500 watt cooler" and the "375 watts available power" from the dual 8-pin power connectors, I was honestly expecting quite a bit more. Hopefully we'll see some community adjustments, like voltage modifications, that we can mess around with later..."
(Image credit: Sapphire)
- You can download the new TriXX v5.2.1 software from the product page at Sapphire here.
Will TriXX v5.2.1 unleash the full potential of the Fury X? We will have to wait for some overclocked benchmark numbers, but having the ability can only be a good thing for enthusiasts.
GPU Enthusiasts Are Throwing a FET
NVIDIA is rumored to launch Pascal in early (~April-ish) 2016, although some are skeptical that it will even appear before the summer. The design was finalized months ago, and unconfirmed shipping information claims that chips are being stockpiled, which is typical when preparing to launch a product. It is expected to compete against AMD's rumored Arctic Islands architecture, which will, according to its also rumored numbers, be very similar to Pascal.
This architecture is a big one for several reasons.
Image Credit: WCCFTech
First, it will jump two full process nodes. Current desktop GPUs are manufactured at 28nm, which was first introduced with the GeForce GTX 680 all the way back in early 2012, but Pascal will be manufactured on TSMC's 16nm FinFET+ technology. Smaller features have several advantages, but a huge one for GPUs is the ability to fit more complex circuitry in the same die area. This means that you can include more copies of elements, such as shader cores, and do more in fixed-function hardware, like video encode and decode.
That said, we got a lot more life out of 28nm than we really should have. Chips like GM200 and Fiji are huge, relatively power-hungry, and complex, which is a terrible idea to produce when yields are low. I asked Josh Walrath, who is our go-to for analysis of fab processes, and he believes that FinFET+ is probably even more complicated today than 28nm was in the 2012 timeframe, which was when it launched for GPUs.
It's two full steps forward from where we started, but we've been tiptoeing since then.
Image Credit: WCCFTech
Second, Pascal will introduce HBM 2.0 to NVIDIA hardware. HBM 1.0 was introduced with AMD's Radeon Fury X, and it helped in numerous ways -- from smaller card size to a triple-digit percentage increase in memory bandwidth. The 980 Ti can talk to its memory at about 300GB/s, while Pascal is rumored to push that to 1TB/s. Capacity won't be sacrificed, either. The top-end card is expected to contain 16GB of global memory, which is twice what any console has. This means less streaming, higher resolution textures, and probably even left-over scratch space for the GPU to generate content in with compute shaders. Also, according to AMD, HBM is an easier architecture to communicate with than GDDR, which should mean a savings in die space that could be used for other things.
Third, the architecture includes native support for three levels of floating point precision. Maxwell, due to how limited 28nm was, saved on complexity by reducing 64-bit IEEE 754 decimal number performance to 1/32nd of 32-bit numbers, because FP64 values are rarely used in video games. This saved transistors, but was a huge, order-of-magnitude step back from the 1/3rd ratio found on the Kepler-based GK110. While it probably won't be back to the 1/2 ratio that was found in Fermi, Pascal should be much better suited for GPU compute.
Image Credit: WCCFTech
Mixed precision could help video games too, though. Remember how I said it supports three levels? The third one is 16-bit, which is half of the format that is commonly used in video games. Sometimes, that is sufficient. If so, Pascal is said to do these calculations at twice the rate of 32-bit. We'll need to see whether enough games (and other applications) are willing to drop down in precision to justify the die space that these dedicated circuits require, but it should double the performance of anything that does.
So basically, this generation should provide a massive jump in performance that enthusiasts have been waiting for. Increases in GPU memory bandwidth and the amount of features that can be printed into the die are two major bottlenecks for most modern games and GPU-accelerated software. We'll need to wait for benchmarks to see how the theoretical maps to practical, but it's a good sign.
Subject: Graphics Cards | September 16, 2015 - 09:16 AM | Sebastian Peak
Tagged: TSMC, Samsung, pascal, nvidia, hbm, graphics card, gpu
According to a report by BusinessKorea TSMC has been selected to produce the upcoming Pascal GPU after initially competing with Samsung for the contract.
Though some had considered the possibility of both Samsung and TSMC sharing production (albeit on two different process nodes, as Samsung is on 14 nm FinFET), in the end the duties fall on TSMC's 16 nm FinFET alone if this report is accurate. The move is not too surprising considering the longstanding position TSMC has maintained as a fab for GPU makers and Samsung's lack of experience in this area.
The report didn't make the release date for Pascal any more clear, naming it "next year" for the new HBM-powered GPU, which will also reportedly feature 16 GB of HBM 2 memory for the flagship version of the card. This would potentially be the first GPU released at 16 nm (unless AMD has something in the works before Pascal's release), as all current AMD and NVIDIA GPUs are manufactured at 28 nm.