Review Index:
Feedback

NVIDIA Pascal Architecture Details, Tesla P100, GP100 GPU

Manufacturer: NVIDIA

93% of a GP100 at least...

NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.

View Full Size

NVIDIA provided a comparison table, which we added what we know about a full GP100 to:

  Tesla K40 Tesla M40 Tesla P100 Full GP100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal)
SMs 15 24 56 60
TPCs 15 24 28 (30?)
FP32 CUDA Cores / SM 192 128 64 64
FP32 CUDA Cores / GPU 2880 3072 3584 3840
FP64 CUDA Cores / SM 64 4 32 32
FP64 CUDA Cores / GPU 960 96 1792 1920
Base Clock 745 MHz 948 MHz 1328 MHz TBD
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz TBD
FP64 GFLOPS 1680 213 5304 TBD
Texture Units 240 192 224 240
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB TBD
L2 Cache Size 1536 KB 3072 KB 4096 KB TBD
Register File Size / SM 256 KB 256 KB 256 KB 256 KB
Register File Size / GPU 3840 KB 6144 KB 14336 KB 15360 KB
TDP 235 W 250 W 300 W TBD
Transistors 7.1 billion 8 billion 15.3 billion 15.3 billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610mm2
Manufacturing Process 28 nm 28 nm 16 nm 16nm

This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.

View Full Size

A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.

Continue reading our preview of the NVIDIA Pascal architecture!!

Pascal's advantage is that these shaders are significantly more complex. First, double-precision performance is finally at a 1:2 ratio with single-precision, which is the highest proportion for both to be first-class citizens. (You can compute two, 32-bit values for each 64-bit one with enough parallelism in your calculations.) This yields a double-precision performance of 5.3 TeraFLOPs at stock clocks, and with just 56 operational SMs, for GP100. Compare this to GK110's 1.7 TeraFLOPs, or Maxwell's 0.2 (yes, 0.2) TeraFLOPs, and you'll see what a huge upgrade this is in calculations that need extra precision (or range).

View Full Size

Second, NVIDIA has also added FP16 values as a first-class citizen too, yielding a 2:1 performance ratio with FP32. This means that, in situations where 16-bit values are sufficient, you can get a full, 2x speed-up by dropping to 16-bit. GP100, with 56 SMs enabled, will have a peak performance of 21.2 TeraFLOPs.

You can multiply by 60/56 to see what the full GP100 processor could be capable of, but we're not going to do that here. The reason why: FLOP rating is also dependent upon the clock rate. If GP100's 1328 MHz (1480 MHz boost) is conservative, as we found on GM200, then this rate could get much higher. Alternatively, if NVIDIA is cherry-picking the heck out of GP100 for Tesla P100, the full chip might be slower. That said, enterprise components are usually clocked lower than gaming ones, for consistency in performance and heat management, so I'd guess that the number might actually go up.

View Full Size

Third, yes this list is continuing, there is a whole lot more memory performance. GP100 increases the L2 Cache from 3MB with GM100 to 4MB with GP100. Since Maxwell, NVIDIA can disable L2 Cache blocks (remember the 970?) so we're not sure if this is its final amount, but I expect that it will be. 4MB is a nice, round number, and I doubt they would mess with the memory access patterns of a professional GPU for scientific applications.

They also introduced this little thing called "HBM2" that seems to be making waves. While it will not achieve the 1TB/s bandwidth that was rumored, at least not in the 16GB variant announced today, 720 GB/s is nothing to sneer at. This is a little more than double what the Titan X can do, and it should be lower latency as well. While NVIDIA hasn't mentioned this, lower latency means that a global memory access should take fewer cycles to complete, reducing the stall in large tasks, like drawing complex 3D materials. That said, GPUs already have clever ways of overcoming this issue, such as parking shaders mid-execution when they hit a global memory access, letting another shader do its thing, then returning to the original task when the needed data is available. HBM2 also supports ECC natively, which allows error correction to be enabled without losing capacity or bandwidth. It's unclear whether consumer products will have ECC, too.

View Full Size

Pascal also introduces two new features: NVLink and Unified Memory. NVLink is useful for multiple GPUs on an HPC cluster, allowing them to communicate at a much higher bandwidth. NVIDIA claims that Tesla P100 will support four "Links", yielding 160 GB/s in both directions. For comparison, that is about half of the bandwidth of Titan X's GDDR5, which is right there on the card beside it. This also plays in with Unified Memory, which allows the CPU to share memory space with the GPU. Developers could write serial code that, without performing a copy, can be modified by a GPU for a burst of highly-parallel acceleration.

Where can you find this GPU? Well, let's hear what Josh has to say about it on the next page.


April 6, 2016 | 02:35 AM - Posted by Anonymous (not verified)

I am not sure how they are fitting that on an interposer. I thought that HBM2 die stacks are specified to be 92 mm2 compared to 49 mm2 for HBM1. Four of them should be close to 400 mm2. With a 610 mm2 GPU die, that would require a 1000 mm2 interposer. Is that a possibility? I guess they could be using only two 8 GB stacks, but they would need to increase the clock speed significantly to go from 512 (spec) to 720 GB/s.

April 6, 2016 | 04:32 AM - Posted by Pixy Misa (not verified)

As I understand it, silicon interposers aren't restricted by the reticle size. The features are large enough that there isn't an alignment problem; they can take up the entire wafer if need be.

April 8, 2016 | 03:16 AM - Posted by Anonymous (not verified)

That is probable a lot more expensive if they have to resort to making them larger than the reticule size.

April 6, 2016 | 08:55 AM - Posted by Josh Walrath

I'm not entirely sure either.  Yes, you can have multiple exposures to make the interposer larger, but I think that how they are getting around it is that the 4 GB HBM2 dies are much smaller than the 8 GB HBM2 units.

April 6, 2016 | 02:20 PM - Posted by Anonymous (not verified)

What about Asynchronous Compute for Pascal, has there been any improvement in Pascal over Maxwell for Nvidia's new Pascal micro-Arch to better schedule processor threads to better utilize the GPU's core execution resources. Is there still the need to wait until the end of a draw call to schedule graphics or compute threads in Nvidia's new Pascal based GPUs or have they improved their thread scheduling granularity to a point that little execution resources are left idle while there is work backed up in the execution queues!

April 6, 2016 | 04:39 PM - Posted by Scott Michaud

These issues aren't going to be discussed at a CUDA/OpenCL summit. No idea.

April 6, 2016 | 06:33 PM - Posted by Anonymous (not verified)

And yet at the Register, a non gaming website!!!

"Software running on the P100 can be preempted on instruction boundaries, rather than at the end of a draw call. This means a thread can immediately give way to a higher priority thread, rather than waiting to the end of a potentially lengthy draw operation. This extra latency – the waiting for a call to end – can really mess up very time-sensitive applications, such as virtual reality headsets. A 5ms delay could lead to a missed Vsync and a visible glitch in the real-time rendering, which drives some people nuts.

By getting down to the instruction level, this latency penalty should evaporate, which is good news for VR gamers. Per-instruction preemption means programmers can also single step through GPU code to iron out bugs."(1)

(1)
http://www.theregister.co.uk/2016/04/06/nvidia_gtc_2016/

It looks like the websites the cover mostly professional server news have a better handle on the technical aspects of new GPU hardware more so than the gaming websites! I can not wait for the Zen server SKUs to be reviewed at the Register for some more information on that front! It looks like Nvidia fixed that on their server/HPC GPU variants so hopefully the same can be said for the consumer variants!
Having unallocated compute resources when the queue is backed up is very bad for processor utilization so it's good that Nvidia fixed that! The HPC/workstation market is not going to tolerate any hardware gimping form Nvidia in this matter so keep up the improvements Nvidia, asynchronous compute in very important!

April 14, 2016 | 05:19 PM - Posted by Allyn Malventano

The fact that they think a single draw call can stall the pipeline for 5ms and cause a missed VR sync pretty much means they are only going off of one line of information and trying to write much smarter than they are. 50ns is a far cry from 5ms.

April 8, 2016 | 04:01 AM - Posted by Anonymous (not verified)

Looking at the Anandtech article (again):

http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification

The slide titled "Mechanical Outline : Molded KGSD" indicates that the 40 mm2 and the 92 mm2 is the size of the package in the specification. The chips would need to be this size for the micro-bumps to line-up with the pads or micro-bumps on the interposer. The micro-bump array is part of the JEDEC specification. Unless they are making non-standard HBM2, it doesn't look like it changes anything if they are going with 4 GB x 4 packages. It says 16 GB and a 4096-bit interface which does imply a 4x4 system. It looks like they will be ~92 mm2 each and the interposer will need to be around 1000 mm2. Is the picture in the article supposed to be an actual device or just a mockup? We already know these things will not be cheap, so they may just be making interposers larger than the reticle. Don't expect such a device in the consumer market anytime soon, if ever though.

April 6, 2016 | 04:43 PM - Posted by Scott Michaud

It's definitely 4x4GB.

April 8, 2016 | 03:18 AM - Posted by Anonymous (not verified)

It is actually about 40 mm2 for HBM1, not 49.

April 8, 2016 | 03:18 AM - Posted by Anonymous (not verified)

It is actually about 40 mm2 for HBM1, not 49.

April 6, 2016 | 02:44 AM - Posted by Anonymous (not verified)

It is surprising that they are making such a large die at 16 nm. I suspect that when, or if, any of these make it to the consumer market that they will be significantly more expensive than the Titan X was. There is no way that they are going to be able to make a 600+ mm2 die on 16 nm with yields anywhere close to what was possible with 28 nm. They may have a huge number of defective parts that can be salvaged though, so perhaps that can do a very cut down consumer version.

April 6, 2016 | 09:41 AM - Posted by Spunjji

I will not be surprised if this ends up as another Fermi - late, initially low-yielding and brute forcing its way to performance dominance. If so then it will probably really shine in its second iteration when they bring down the defect rate and/or with the smaller die variants.

Funnily enough it looks like AMD are taking the same approach they did back then, with a smaller and more area-efficient product.

April 6, 2016 | 11:24 AM - Posted by Josh Walrath

It is interesting to see that NV might be taking that route again.  But at least with this implementation, there is no other competition at this extremely high end.  Does not matter how hot it runs or yields, if they sell each card for 15K then they are more than covering their expenses getting these parts out.

April 6, 2016 | 02:10 PM - Posted by Anonymous (not verified)

If they are clocking their GPUs higher then maybe the 16nm process in engineered to have a larger pitch(distance between circuits/gates) than would normally be used for a GPU's layout. CPUs have their layouts less dense to run at higher clocks but that does not negate the circuit gains from having 14nm/16nm gate sizes, it just means that the circuits are spaced fruther apart for better heat dissipation. Circuit pitch(spacing) plays an important part in the thermal ability of a processors to handle higher clock speeds at the cost of space savings even at 14nm/16nm gate sizes with their inherent advantages. So maybe the larger die size is a trade off for higer clock speeds at the cost of some loss of space savings for these high end server/pro SKUs.

April 6, 2016 | 04:41 PM - Posted by Scott Michaud

Yeah, it's interesting that NVIDIA's enterprise clock is around 400 MHz higher than it used to be. I'm curious to see how high consumer will go.

April 6, 2016 | 04:11 AM - Posted by [CoFR]Prodeous (not verified)

Seeing Fury X single precision being at 8.6 Gigaflops, Pascals 10.6 Gigaflops doesn't sound such a major improvement.

Will admit that dual precision is where Pascal shines, exceeding 5 Gigaflops. Nicely done.

I am highly confused about the 1/2 precision. It feels like such a marketing play.. after all 20 Gigaflops of 1/2 precision is a big number. My question is what would this be used for?

April 6, 2016 | 04:12 AM - Posted by [CoFR]Prodeous (not verified)

Correction. Teraflops, not Gigaflops on all numbers ...

April 6, 2016 | 04:41 AM - Posted by Vicen (not verified)

Usually raw teraflops is not a very good measure of how fast you can compute in practice, memory bandwith tends to be the bottleneck. In this case I would expect the performance of the P100 exceed the Fury X by a larger marging than 20%, for the deep learning or the CFD that we use the GPUs for the P100 could easily run twice as fast as the Fury.

April 6, 2016 | 09:04 AM - Posted by [CoFR]Prodeous (not verified)

Is FP16 or half-precision used for deep learning?

With regards to 20% margin. For me it is just about their claim. I agree that actual performance might be bigger. HBM2 will in itself play a big role.

So i'm not even going to guess the actual performance.

Will admit I am really really happy that they are back to 1/2 speed dual precision compared to 1/32? in the previous ones.

April 6, 2016 | 11:25 AM - Posted by Josh Walrath

Yeah, there are a lot of workloads where FP16 is more than adequate for their needs.  Deep learning is one of them.

April 6, 2016 | 10:59 AM - Posted by renz (not verified)

Deep learning. SP might be underwhelming but then again you have to look how well nvidia GK110/GK210 defend their position vs much superior amd hawaii firepro.

April 6, 2016 | 02:31 PM - Posted by Anonymous (not verified)

That is done to compete with the Xeon Phi, and also the requirments of the server/HPC market for which this SKU is the intended target. So that DP FP to SP FP ratio is better for more DP compute performence that the server/HPC market requires. So this SKU is ahead of the Xeon Phi by a larger margin.

April 6, 2016 | 09:09 AM - Posted by Anonymous (not verified)

"AMD's Fury may be the first GPU to feature an interposer and HBM1 memory, but the P100 will quickly outclass this product. If high end enthusiasts can even get a hold of it..."

And therein lies the rub. AMD was able to do it and get it into the hands of high-end gamers. NVIDIA can't even do that.

April 6, 2016 | 09:39 AM - Posted by Anonymous (not verified)

Well, it hardly matters if the memory isn't bottlenecking and considering how well Ti-line performs compared to Fury, it is to be said, that in gaming it really doesn't, at least, not yet. AMD failed on relying too much on HBM-technology too soon, because it seems to be evident that they can't bring it to consumer products in any meaningful (monetizing) way before Nvidia does so too (HBM2). I doubt Fury brought them enough marketshare and profits compared to the costs of making the product in the first place.

April 6, 2016 | 09:42 AM - Posted by Spunjji

Agreed, but it does mean they're on their second-gen product with the tech. They have already worked out the complex inter-company relationships needed to get these products running and out the door.

Whether that gives them any benefit /in practice/ remains to be seen.

April 6, 2016 | 06:57 PM - Posted by Anonymous (not verified)

Really AMD developed the HBM technology/standard with SK Hynix, and AMD will also be using HBM2, I do not see how that can be a fail for AMD when AMD is already demonstrating consumer based polaris SKUs. AMD is the co-creator of HBM, I doe not see Nvidia being in the fron lines of developing any open standards/JEDEC standards for ALL to use like AMD did with HBM. Hopefully AMD will make some inroads into the HPC workstation market with their server/HPC APUs on an interposer AMD needs that business more than ith needs only the consumer side of things.

Google will be using Power9 based servers so AMD needs to maybe get a power9 license from openpower, and do x86, ARM(K12), and some power based GPU acceleration products. Nvidia has the power GPU acceleration market all to itself currently.(1) PCI 3.0 is not fast enough for the HPC/exascale market so Both AMD and Nvidia will have to compete with Nvida currently leading in the server/HPC market, and x86 based SKUs are no longer the only game in town across all markets except the PC/laptop market but that will change too for the PC/laptop market if some power8/power licensee builds a PC variant using Power ISA based CPUs.

(1)

http://www.theregister.co.uk/2016/04/06/google_power9/

April 6, 2016 | 07:26 PM - Posted by Anonymous (not verified)

P.S. another article or Google and Rackspace using Power9's

http://www.theregister.co.uk/2016/04/06/google_rackspace_power9/

LOOK out makers of x86 only based products(Intel) as at least AMD has its K12 custom ARM cores(Jim Keller designed). OpenPower is licensing power8's, and newer designs, and Nvidia has a lead supplying GPU accelerators for Power8/9 based systems! Better look into that market also AMD and get a power8/9 license and integrate your GCN based Polaris/Vega IP into that power ISA based marketplace.

April 9, 2016 | 07:19 PM - Posted by Anonymous (not verified)

Well maybe AMD will be working with Intel on a server based option since they been getting so chummie lately.

April 6, 2016 | 12:36 PM - Posted by jcaf77 (not verified)

so... when are the consumer enthusiast cards coming out????

April 6, 2016 | 01:17 PM - Posted by Josh Walrath

My guess would be 2017?

April 6, 2016 | 02:30 PM - Posted by jcaf77 (not verified)

so i should go ahead and buy the 980 ti then... btw josh, you are hilarious. my type of humor

April 7, 2016 | 12:19 AM - Posted by Josh Walrath

Thanks for watching. Don't invest in 980Ti yet... give a couple of weeks for more rumors and leaks to come out before spending $600 on a card that could be overshadowed in 3 months.

April 8, 2016 | 02:39 PM - Posted by svnowviwvn

Wrong.

July 2016.

http://techreport.com/news/29961/rumor-nvidia-to-launch-gtx-1080-and-gtx...

Nvidia will unveil a consumer Pascal chip to the public at Computex 2016, in the form of its GeForce GTX 1080 and 1070 cards. Digitimes says that card makers will fire up mass production of Pascal-based GeForces during July. Asus, Gigabyte, and MSI are among the players expected to show cards at Computex.

April 15, 2016 | 11:22 AM - Posted by John H (not verified)

1070/1080 are considered 'high end' (GP104) - the 1080ti/Titan are 'enthusiast (GP100 core). The 1080 will likely be faster than a 980Ti.. but later a 1080Ti / Titan will come out that smack that down pretty strongly.. The card is likely either to be a 'holiday shopping special' or a 2017 card, based on the timing of everything here.

I bought a 980Ti a month ago to replace my 970 SLI and am very happy; although I bought it because i wanted to play Elite Dangerous at a reasonable setting on my Oculus. If you can wait - the new 1070 or 1080 should be a bargain compared to 980Ti pricing and offer => performance. Not to mention AMD has a whole new generation coming soon too..

April 6, 2016 | 12:38 PM - Posted by skysaberx8x

shut up and take my money :)

April 6, 2016 | 01:06 PM - Posted by Idiot (not verified)

Didn't read the article. Does Pascal fully support Async Compute? Thanks.

April 6, 2016 | 01:18 PM - Posted by Josh Walrath

They didn't go into that level of granularity or address that exact question.

April 6, 2016 | 03:17 PM - Posted by funandjam

I get how async compute works in gaming. But their keynote was about sever, development and professional technologies and applications. Would any of those technologies presented benefit from async compute in any significant way?

April 6, 2016 | 04:36 PM - Posted by Scott Michaud

Not sure, but I doubt it affects OpenCL or CUDA. It's designed to independently load the 3D and compute engines, but the former isn't used there.

April 6, 2016 | 04:46 PM - Posted by funandjam

If what you say is true, then that makes sense of why it wasn't brought up in their keynote.

Now we just wait and see when they announce about the consumer desktop GPU side of things.

April 6, 2016 | 01:22 PM - Posted by Danny (not verified)

If Pascal performs like crap in DirectX 12 performance, I will therefore buy AMD Polaris, plus I don't want to spend ridiculous amount of money on G-Sync monitor since there's less benefit with high refresh rate monitor. Moreover, not all games would work with G-Sync, I don't want to cope with extra input latency which majority PC gamers strongly prefer Vertical-Sync Off.

I don't about NVIDIA, but it seems like NVIDIA is stepping into monopoly position based on their business standpoint, things like Gameworks, G-Sync, PhysX, etc. But in the meantime, I'd wait until there's a real benchmark between Polaris and Pascal GPU. And hopefully, next gen. GPU announce at Computex 2016 in Taipei, Taiwan on either end of May or June.

April 6, 2016 | 04:38 PM - Posted by Scott Michaud

Generally speaking, it makes sense to choose the best results for your budget. If Pascal under-performs, then it makes sense to use Polaris. Likewise, vice-versa. We'll see.

April 9, 2016 | 07:20 PM - Posted by Anonymous (not verified)

Like the guy said if its the same he does not want a gsync monitor and Nvidia wont support Freesync like Intel.

April 6, 2016 | 06:00 PM - Posted by Allyn Malventano

>> Moreover, not all games would work with G-Sync,

I've played a lot of games on FreeSync / G-Sync and the only real requirement appears to be full screen. G-Sync also has a mode that works on desktop / windowed games, but not as well as the full-screen experience.

>> I don't want to cope with extra input latency which majority PC gamers strongly prefer Vertical-Sync Off.

VRR displays continue to draw at the 'max' speed even when in varying frame rates, which means that the speed of the scan at 40 FPS is just as fast as it is at 144/165. The end result is that the advantage to running VSYNC-off is nearly negligible.

April 6, 2016 | 05:50 PM - Posted by john vitz (not verified)

No tdp improvement. One day a jillion watt gpu and it will be norm

April 11, 2016 | 11:32 AM - Posted by JesusBaltan (not verified)

no improvement?

look at the clock frequencies of the GPU and then we'll talk

April 6, 2016 | 10:08 PM - Posted by Anonymous (not verified)

And again, we get a nice little 3D rendering of "Pascal," that's pretty dandy
But where the hell is a physical pascal chip? Seriously, something must have gone WAY wrong if they haven't even showed a chip publicly, let alone in a demo..

Meanwhile RTG has been setting up demos for what, three months now?

Something has definitely gone wrong.

April 7, 2016 | 02:36 AM - Posted by Mandrake

Nice write up! For those who didn't see it, Nvidia did indeed have a P100 demo at GTC.

http://a.disquscdn.com/uploads/mediaembed/images/3460/3367/original.jpg

It appears to be eight GP100 GPUs running in parallel.

April 7, 2016 | 09:02 AM - Posted by PL (not verified)

Am I reading this right? "GPU Die Size 551 mm2 601 mm2 610 mm2 610mm2"

61cm² dies? that's almost as big as my intire case... someone get their metric system wrong?

April 7, 2016 | 02:44 PM - Posted by Tim Verry

6.1cm^2 dies

April 8, 2016 | 12:26 AM - Posted by Scott Michaud

100mm2 = 1cm2

1cm2 = 1cm * 1cm = 10mm * 10mm = 100mm2

April 15, 2016 | 11:22 AM - Posted by John H (not verified)

How many inches is that?

;)