Rumor: NVIDIA Pascal up to 17 Billion Transistors, 32GB HBM2

Subject: Graphics Cards | July 24, 2015 - 12:16 PM |
Tagged: rumor, pascal, nvidia, HBM2, hbm, graphics card, gpu

An exclusive report from Fudzilla claims some outlandish numbers for the upcoming NVIDIA Pascal GPU, including 17 billion transistors and a massive amount of second-gen HBM memory.

According to the report:

"Pascal is the successor to the Maxwell Titan X GM200 and we have been tipped off by some reliable sources that it will have  more than a double the number of transistors. The huge increase comes from  Pascal's 16 nm FinFET process and its transistor size is close to two times smaller."

View Full Size

The NVIDIA Pascal board (Image credit: Legit Reviews)

Pascal's 16nm FinFET production will be a major change from the existing 28nm process found on all current NVIDIA GPUs. And if this report is accurate they are taking full advantage considering that transistor count is more than double the 8 billion found in the TITAN X.

View Full Size

(Image credit: Fudzilla)

And what about memory? We have long known that Pascal will be NVIDIA's first forray into HBM, and Fudzilla is reporting that up to 32GB of second-gen HBM (HBM2) will be present on the highest model, which is a rather outrageous number even compared to the 12GB TITAN X.

"HBM2 enables cards with 4 HBM 2.0 cards with 4GB per chip, or four HBM 2.0 cards with 8GB per chips results with 16GB and 32GB respectively. Pascal has power to do both, depending on the SKU."

Pascal is expected in 2016, so we'll have plenty of time to speculate on these and doubtless other rumors to come.

 

 

Source: Fudzilla

Video News


July 24, 2015 | 12:50 PM - Posted by Chaitanya Shukla

Only couple of my workstations have 32GB of memory as system memory.

July 24, 2015 | 02:10 PM - Posted by brisa117

I haven't ever considered it, but is there some need to have system memory larger or at least equal to the GPU memory? I wouldn't think there would be a reason. It's simply never been a problem before, but GPU memory capacity seems to be ballooning lately. I too have 32GB in my main video editing/gaming machine, but only 8 or 16 in my other handful of systems. And who can afford more than 16GB with the DDR4 prices (hopefully they'll come down with Intel's 100-series motherboards).

I remember having 6GB of system ram and 1GB of DDR5 on my Radeon HD 4890. Haha. Good times.

July 24, 2015 | 02:47 PM - Posted by BillDStrong

Yes, there is.

Bandwidth and latency are the two limiting factors of computing. Latency is actually more important for making sure you are using every cycle, but we try to hide as much latency as possible through the use of excess bandwidth. This is why GPUs have between 100Gbs - the current 512Gbs bandwidth. We can only hide the latency so much, however. And you can create benchmarks that show the latency still, which means some developer somewhere that doesn't know better is hitting it and not realizing it.

You will notice that CPU bandwidth is an order of magnitude slower than GPU bandwidth. But every bit of the information that originates on the main storage, such as meshes and textures or data sets, has to go through main memory. Then SSDs are an order of magnitude slower than system memory, and HDDs are another order of magnitude still.

And CPU memory is servicing the entire system, not just the CPU and the GPU. It is still faster than the SSD and the HDD, so a good developer will use as much of it as possible to prevent needing to pull from the disk directly to the GPU.

Having more GPU memory than CPU memory means for any application that would fill up the GPUs memory, you force the GPU to pull from the slowest resource, the SSD or worse the HDD. This is guaranteed to cause stutter and other issues in real time use cases, and slow down GPU compute tasks by several factors.

Of course, good devs try to minimize the data pushed to the GPU, which means they keep the data that doesn't change on the GPU, but you aren't always able to do that, either.

Also, the GPU uses the memory just like the CPU would, so while not all of the information stored in the GPU comes from the CPU's memory, the latency and bandwidth differences work out to mostly keep the GPU occupied, though almost never to 100%.

All of these issues are of course made much worse in mult GPU setups. One of the reasons that Nvidia is making this new link tech is to stop having to copy information back into main memory before sending it to the other cards. This is also why they will be able to use up to 8 cards ganged together for real time games, and compute tasks.

July 24, 2015 | 03:47 PM - Posted by Scott Michaud

There are a few semi-inaccuracies in this, but you're pretty much correct.

First point:

"Having more GPU memory than CPU memory means for any application that would fill up the GPUs memory, you force the GPU to pull from the slowest resource, the SSD or worse the HDD."

This is not exactly true. You're assuming that assets cannot simply reside in video memory and never swap out (except during a loading screen). You do state this later...

"Of course, good devs try to minimize the data pushed to the GPU, which means they keep the data that doesn't change on the GPU, but you aren't always able to do that, either."

... but you don't really address its limitations. Namely, "but you aren't always able to do that, either" gets less and less relevant as video memory increases and static assets can safely remain in place.

First, video memory could continue on its overkill path such that there's always super-abundance.

Second, and this might become increasingly important, GPU compute might itself need large chunks of video memory as scratch space or storage for intermediate assets. Imagine that they are performing repetitive calculations on a relatively small input data set. It might make sense to run a kernel that loads this data into local memory, pre-sorts it, and dumps it back to video memory for a second kernel to process it. This extra step optimizes for massive bandwidth and hides latency by doing it on massive, sequential chunks of data at a time. This could require a lot of video memory for content that started and ended on the GPU. System memory was never used (except to trigger the GPU to generate content).

But yeah, the situation you're describing is what often happens in practice AFAIK.

Second point:

"Bandwidth and latency are the two limiting factors of computing. Latency is actually more important for making sure you are using every cycle, but we try to hide as much latency as possible through the use of excess bandwidth. This is why GPUs have between 100Gbs - the current 512Gbs bandwidth. We can only hide the latency so much, however. And you can create benchmarks that show the latency still, which means some developer somewhere that doesn't know better is hitting it and not realizing it."

GPUs tend to hide latency through lots of bandwidth and high parallelism. They do some intense tricks, too. A workgroup will be ripped mid-execution and parked while a global memory access is queued, letting the compute unit do something completely different in the mean time. Of course, latency is important, but you could make the same argument in the CPU world with frequency vs IPC. A higher frequency reduces the lag between neighbouring operations, but your overall work is greater if you can take advantage of vectorization or otherwise do several things at once.

You could make benchmarks that show clock rate dependence too, but a Core 2 will be faster than a Pentium 4 in practice.

July 25, 2015 | 03:08 AM - Posted by Anonymous (not verified)

does this increased amount of VRAM mean that entire games assets can be RAMdisc'd onto the GPU and will that solve any latency problems? Basic user here but digging the dialectic between you two.

July 26, 2015 | 12:51 AM - Posted by Scott Michaud

That depends on what you mean by "latency". There are many different types of latency. If you're talking about the stutter and pop-in as a game loads more assets, then adding more VRAM would help if the developers take advantage of it.

Everything is a tool that developers can exploit. Granted, most of the things that I've done from scratch are based on GPU-compute. I don't have much background in developing 3D engines with DirectX or OpenGL. Just a little bit.

July 25, 2015 | 08:44 PM - Posted by BlackDove (not verified)

A dual socket workstation can have 1TB of RAM now.

July 24, 2015 | 01:21 PM - Posted by KC (not verified)

Hmm.. Do I still buy a 980 Ti or do I wait to see if this rumour becomes reality..

July 24, 2015 | 01:31 PM - Posted by hans meiser (not verified)

Pascal will not come cheap and I doubt it will come early 2016. AMD got nothing on the 980Ti so why rush?

July 24, 2015 | 02:38 PM - Posted by Anonymous (not verified)

Except that HBM2 is a drop-in replacement for HBM1 on AMD's Fury parts, so even before Greenland arrives, AMD could get a Fury-X revision to market with 8, or more gigs of HBM memory. The interposer work is done for AMD and its the bottom chip in the HBM stack that hosts the control logic for the HBM die stacks above, and the interposer memory traces for HBM2 are not increasing. A Fury-XA revision may be available just as the HBM2 memory stacks arrive hot off that final assembly line. That and some Tweaks could put Fury over the Ti's performance metrics, and maybe with process refinements and some more overclocking on any Fury-X revisions. AMD has exclusivity on that lines process, and its HBM, Nvidia has got to take the available to all standards and make a line of its own, outside of AMD's line that has AMD's name on it!

July 25, 2015 | 02:44 AM - Posted by Anonymous (not verified)

and? HBM2 will do absolutely nothing for Fury X as its not even remotely bandwidth constrained with HMB1.

July 25, 2015 | 03:32 PM - Posted by bpwnes (not verified)

Actually the problem with HBM1 is that in AMD's current implementation it has a maximum of 4GB.

"For HBM1, the maximum capacity of a HBM stack is 1GB, which in turn is made possible through the use of 4 256MB (2Gb) memory dies. With a 1GB/stack limit, this means that AMD can only equip the R9 Fury X and its siblings with 4GB of VRAM when using 4 stacks."

Source: http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/7

The real question is how long 4GB of VRAM will be enough with current and near future console ports.

July 27, 2015 | 11:40 AM - Posted by Anonymous (not verified)

Right, but that's just in reference to the lifespan of the Fury X, not its performance limitations. It currently has the highest available memory bandwidth yet is still not the best-performing card. The other components on the card are holding that bandwidth back.

You're talking about the limitations of HBM, the guy before you was talking about the limiting components of the Fury X. It seems having even faster memory/higher bandwidth on the Fury X wouldn't make it perform better, since the fastest graphics card outperforms it with less memory bandwidth.

July 24, 2015 | 01:41 PM - Posted by Anonymous (not verified)

Will probably have cooling issues and who knows on the drivers.

July 24, 2015 | 02:06 PM - Posted by Anonymous (not verified)

Actually, wouldn't cooling issues be even easier and better since you'd have one solid copper block covering both VRAM and the GPU because of their close proximity?

Seems like it's just another added benefit of HBM unless I'm missing something like how that's perhaps more inefficient than individually cooling each VRAM module or using a separate block on the VRAM.

July 24, 2015 | 05:20 PM - Posted by heavy (not verified)

that soo true.sorry but i wana add something that im kinda mad about i been looking for a full waterblock cover for fury x thinking it was going to be cheaper since they could use less coper since everything is just right their but no a full block for a simpler and smaller cards is around a 100 bucks same as all other larger and more complicated cards.why!sorry about the rant it over

July 24, 2015 | 01:51 PM - Posted by Anonymous (not verified)

Get.
In.
My.
Computer.

July 24, 2015 | 01:57 PM - Posted by Tooch (not verified)

I think the 17B transistor in this case includes the HBM, which in fiji 4GB was roughly 1.2B transistors. So with pascal having 16GB of HBM that is about 4.8B so this would put the actual pascal chip at just over 12B or 50% bigger than titan x. Which if the reports about 16nm are right about being 2x dense. This would put the pascal chip at 75% the size of titan x or 400 to 450mm^2 which is probably the highest you'd want to go with a new process node

July 24, 2015 | 04:42 PM - Posted by Anonymous (not verified)

I don't think when they means transistor count they will include those on the HBM, you never see Fiji include them in as well when they claim they have 8.9 billion transistors (instead of 12 with HBM included.)

You need to take note that 16FF has 2x the density of 28nm while 16FF+ is even smaller 16FF, which effectively make it more than 2X the die shrink (If Pascal is really using 16FF+).

That would approximately make it about 600mm sq same as TX.

July 24, 2015 | 05:16 PM - Posted by Tooch (not verified)

First during the initial announcement of fiji they did state that the total interposer had over 10 billion transistors 8.9 of which were the actual chip. Second this is a rumor and throwing out the higher # sounds better. I agree that scaling may be better than 2x but I highly doubt that Nvidia will do another Fermi and try to make a large die on a new process node again. Thus somewhere in the lower part of 400 to 500mm^2 would be where I see the first tape out maxing out at. Historically no chips go bigger than that on a new node (except fermi). I could see hitting 18 to 20B once the node has matured. Max for pascal is probably 12 to 14B rest is HMB2

July 24, 2015 | 02:03 PM - Posted by Anonymous (not verified)

Rumors are saying that the HBM 2.0 32GB models will be targeted for the professional market, not consumers even at a $1,000 Titan range. Having 16GB is more than enough for 4K. Multiple sites proved with the Titan X that for a single 4K screen 6GB is even more than enough currently.

By the time 16GB is even needed I'm sure HBM 3.0 will be out doubling capacities once more. Still it's sex to think of what HBM will do. It'll finally put a nail in the coffin regarding bus width, bandwidth, and memory capacity at the same time.

July 24, 2015 | 03:09 PM - Posted by BillDStrong

Keep in mind they do need to put up a card to compete with AMD's professional solution that already has 32GB. And, Games built for today don't use 8GB currently. But they still use 1-2k textures, with a few 4k textures for terrain. A very few have 8k textures that are basically texture atlasses, and are used to save a few draw calls.

AMD and Nvidia have both supported 8k textures for several generations, I have a laptop that supports 16k textures, and 32k textures are supported/will be supported very soon. Even Sandy Bridge graphics support 8k textures.

When game devs really start to take advantage of this, you will see a world of difference in games. They will either create 16/32k atlas textures, reducing draw calls by even more, the visual quality of the games will improve, and combinations of the two.

July 24, 2015 | 02:18 PM - Posted by Anonymous (not verified)

Specialized HPC accelerator parts, and workstation parts maybe, but those total transistor numbers need to be vetted, least they include some non logic, non cache memory that is not integrated into a single die. I can see the potential of maybe getting some special order parts that are 2 GPUs on one interposer, but not by 2016 for Nvidia.

Getting those supplies of HBM are going to be a problem for Nvidia as AMD has the co-developer lock-in on AMD's co-developed working HBM supplies for the next few years. The HBM standards for HBM may be available for any and all to develop their own HBM on whatever HBM making process that can be created, but the actual working HBM based on the standards HBM for AMD and its HBM partner are locked up by contract with AMD and its development partner, and for sure every bit of HBM coming from that line has AMD's name on it.

HBM as a standard is just like PCI as a standard anyone can invest the time and money to develop their own brand of PCI controller chips based on the PCI standards and sell that under excusive arrangements, and AMD has done just that with HBM, and now reaps the reward of being the first to make a process built around the HBM standard, and that process for making HBM is what AMD and its HBM partner have the lock-in on, as they have invested not only in creating the standard, but in making a working product based on the HBM standard. Nvidia is going to have to find an HBM partner that can give it more than just what AMD's HBM development partner has the option of selling to Nvidia after AMD's needs are met.

Now after there are a number of HBM suppliers on the market, not tied to any HBM process exclusivity agreements, then things are different, and just look at how many years AMD and its partner spent on designing, and getting the HBM standard ratified and in actually making a working marketable HBM standards based product, Nvidia is behind the curve on HBM and may have to reinvent an HBM(standards based) process wheel to get sufficient supplies for all of its product lines. AMD spent the time and money and that shared process belongs to both AMD and its HBM memory partner, Nvidia will have to get a partner and fund some process development on its own, it's welcome to that standard, but standards are just written documents/and IP usage rights, the actual working process/es to develop a line of compatible with the HBM standards actual working HBM that's another matter!

July 24, 2015 | 02:27 PM - Posted by JohnGR

From Wiki

GT200 Tesla 1,400,000,000 2008 NVIDIA 65 nm 576 mm²

GF110 Fermi 3,000,000,000 Nov 2010 NVIDIA 40 nm 520 mm²

GM200 Maxwell 8,100,000,000 2015 NVIDIA 28 nm 601 mm²

Going to 17 billions with 14-16nm tech is just simple maths. As for the 32GB, it's old news. It was mentioned in the past that Nvidia was going to implement that much VRAM. And of course we already have GDDR5 cards with that much memory.

July 24, 2015 | 05:21 PM - Posted by heavy (not verified)

i would take less HBM is it means a cheaper cards i dont need all 32gb

July 24, 2015 | 05:34 PM - Posted by Anonymous (not verified)

Making HBM and intergrated next to GPU is no walking in the park... very complicated processes involved
...
http://www.3dincites.com/2015/07/at-amd-die-stacking-hits-the-big-time/
http://www.3dincites.com/2014/03/road-3d-ics-paved-3d-soc/
“This project required collaboration with 20 different companies and government organizations before delivering the final product,” said Black. “We worked with many partners along the way. Some disappeared, some delivered. Some we’re still working with.” The final product integrates graphics chips from TSMC, OSAT partners ASE and Amkor; HBM was procured from SK Hynix, and UMC provided the interposer.
“This technology will benefit the semiconductor industry for a long time to come.“ added Cheung.

July 24, 2015 | 06:23 PM - Posted by Anonymous (not verified)

NVDA will have to pay a lot of licensing fee for process IP to HBM partners (they have been working for the last 5 years to invent and enhance the processes). This Pascal GPU will be for professional graphics and HPC market only first in order to compete with AMD. Consumers may have to pay > $2000 when it becomes available.
Gaming market may not need this powerful beast as DX-12 can turn mid-range GPU cards into very high-end ones (we will see in the next few months).
If you agree that Windows-10 will kill PC upgrades then DX-12 will do the same to GPUs too.

July 25, 2015 | 03:19 AM - Posted by renz (not verified)

So did Mantle able to make 270X about as fast as 290X in DX11?

July 25, 2015 | 04:04 PM - Posted by fkr

well depending on how good the CPU is and the game the gains are usually between 8-40% increase in performance. so a 270 will not be as good as a 290 as that is a huge jump process tech but it will outperform a 280x.

there is allot more to this than just GPU as directx 12 should take the CPU bottleneck out of the equation

http://www.gamersnexus.net/guides/1885-dx12-v-mantle-v-dx11-benchmark

July 26, 2015 | 11:44 PM - Posted by Anonymous (not verified)

HBM is not a copyrighted IP to amd or SKHynix
its a JDEC open standard. the same DDr and Gddr is
So Nvidia will get another company to make HBM. samsung would be the best option. because AMD will have Hynix pretty busy. and no one wants Micron ram.

July 26, 2015 | 11:56 PM - Posted by BlackDovef (not verified)

Intel is using Micron HMC in Knights Landing so idk why youd say that lol.

July 27, 2015 | 11:56 AM - Posted by Anonymous (not verified)

Micron developed Hybrid Memory Cube. I guess it's supposed to replace DDR4, but possibly also GDDR5? The wikipedia page is a bit confusing and other sources are written for people much more knowledgeable than I.

July 28, 2015 | 02:04 AM - Posted by BlackDove (not verified)

Micron developed HMC. It is designed for CPUs. Its designed as an extremely high bandwidth form of RAM. I dont know if it can be easily used to substitute GDDR SGRAM.

It can be used on or off package like in Knights Landing, which uses Intel's EMIB package(different than an interposer).

July 25, 2015 | 02:13 AM - Posted by Dallas (not verified)

32gb.. What Am I going to Power the first 12k Screen in the world. Or the HUUGE Screen in Dallas Metrodome.

The F***.. 32gb.. Jesus thats Improbable

July 25, 2015 | 02:34 AM - Posted by Anonymous (not verified)

Interesting. All previous information I've seen up to now stated that the max capacity of HBM2 would be 16GB, not 32GB.

July 25, 2015 | 02:47 AM - Posted by Anonymous (not verified)

where is that info? HBM2 was always listed as being up to 32 GB in everything I remember seeing.

some of the comments here are silly as of course the regular desktop cards will be either 8 or 16 with professional cards offering 32.

July 25, 2015 | 03:35 AM - Posted by Anonymous (not verified)

Only $1999,32GB is yours.

July 25, 2015 | 07:21 AM - Posted by JohnGR

If AMD is alive at that time, probably. If not, then those $1999 are for the Titan version with 16GB and 1/128 FP64

July 26, 2015 | 10:43 AM - Posted by BlackDove (not verified)

GP100 will have excellent FP64 since its going into pre exascale systems. Its also competing with Knights Landing which is 3TFLOPS DP and 6TFLOPS SP with about 500 GB/s memory bandwidth.

July 25, 2015 | 06:20 PM - Posted by DaveSimonH

A single gpu card with 32GB seems a little nonsensical, especially with current HBM pricing. But the next $1000 Titan sporting 16GB HBM, and their next dual GPU card with 32GB HBM seems plausible.
Then the 980/980ti successors with 8GB HBM to rival the Fury/FuryX replacements (or refinements, a la 290X to 390X).

July 25, 2015 | 08:57 PM - Posted by BlackDove (not verified)

32GB is the MAXIMUM POSSIBLE AMOUNT. With a 384bit bus GDDR5 could go UP TO 12GB.

When GDDR5 came out no cards had 12GB. Now they do because they need that much and there are users who need more.

This GPU is also competing with Knights Landing which will both be going into pre exascale supercomputers, which has 16GB HMC and up to 384GB DDR4 PER CHIP.

July 27, 2015 | 12:01 PM - Posted by Anonymous (not verified)

Isn't Knights Landing a server CPU? How is a GPU... oh wait, now I get it.

July 28, 2015 | 01:24 AM - Posted by BlackDove (not verified)

GPUs and CPUs are both used in supercomputers and have been for years. Knights Landing is hardly a conventional server CPU. 72 Atom cores with AVX 512 unit, 16GB of on package HMC and similar FP64 performance to a GP100 GPU makes Knights Landing very different than a normal E5 or E7 Xeon.

July 29, 2015 | 08:25 AM - Posted by Edmond (not verified)

Dont worry kids! You`ll still get super overpriced marginal upgrades throughout the entire lifetime of 16nm.

July 29, 2015 | 08:27 AM - Posted by Edmond (not verified)

Fuck, i cant wait for some SMALL, VESA mountable zboxes with some of this.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.