What GP102 Could Mean for NVIDIA

Manufacturer: NVIDIA

Is Enterprise Ascending Outside of Consumer Viability?

So a couple of weeks have gone by since the Quadro P6000 (update: was announced) and the new Titan X launched. With them, we received a new chip: GP102. Since Fermi, NVIDIA has labeled their GPU designs with a G, followed by a single letter for the architecture (F, K, M, or P for Fermi, Kepler, Maxwell, and Pascal, respectively), which is then followed by a three digit number. The last digit is the most relevant one, however, as it separates designs by their intended size.

View Full Size

Typically, 0 corresponds to a ~550-600mm2 design, which is about as larger of a design that fabrication labs can create without error-prone techniques, like multiple exposures (update for clarity: trying to precisely overlap multiple designs to form a larger integrated circuit). 4 corresponds to ~300mm2, although GM204 was pretty large at 398mm2, which was likely to increase the core count while remaining on a 28nm process. Higher numbers, like 6 or 7, fill back the lower-end SKUs until NVIDIA essentially stops caring for that generation. So when we moved to Pascal, jumping two whole process nodes, NVIDIA looked at their wristwatches and said “about time to make another 300mm2 part, I guess?”

The GTX 1080 and the GTX 1070 (GP104, 314mm2) were born.

View Full Size

NVIDIA already announced a 600mm2 part, though. The GP100 had 3840 CUDA cores, HBM2 memory, and an ideal ratio of 1:2:4 between FP64:FP32:FP16 performance. (A 64-bit chunk of memory can store one 64-bit value, two 32-bit values, or four 16-bit values, unless the register is attached to logic circuits that, while smaller, don't know how to operate on the data.) This increased ratio, even over Kepler's 1:6 FP64:FP32, is great for GPU compute, but wasted die area for today's (and tomorrow's) games. I'm predicting that it takes the wind out of Intel's sales, as Xeon Phi's 1:2 FP64:FP32 performance ratio is one of its major selling points, leading to its inclusion in many supercomputers.

Despite the HBM2 memory controller supposedly being actually smaller than GDDR5(X), NVIDIA could still save die space while still providing 3840 CUDA cores (despite disabling a few on Titan X). The trade-off is that FP64 and FP16 performance had to decrease dramatically, from 1:2 and 2:1 relative to FP32, all the way down to 1:32 and 1:64. This new design comes in at 471mm2, although it's $200 more expensive than what the 600mm2 products, GK110 and GM200, launched at. Smaller dies provide more products per wafer, and, better, the number of defective chips should be relatively constant.

Anyway, that aside, it puts NVIDIA in an interesting position. Splitting the xx0-class chip into xx0 and xx2 designs allows NVIDIA to lower the cost of their high-end gaming parts, although it cuts out hobbyists who buy a Titan for double-precision compute. More interestingly, it leaves around 150mm2 for AMD to sneak in a design that's FP32-centric, leaving them a potential performance crown.

View Full Size

Image Credit: ExtremeTech

On the other hand, as fabrication node changes are becoming less frequent, it's possible that NVIDIA could be leaving itself room for Volta, too. Last month, it was rumored that NVIDIA would release two architectures at 16nm, in the same way that Maxwell shared 28nm with Kepler. In this case, Volta, on top of whatever other architectural advancements NVIDIA rolls into that design, can also grow a little in size. At that time, TSMC would have better yields, making a 600mm2 design less costly in terms of waste and recovery.

If this is the case, we could see the GPGPU folks receiving a new architecture once every second gaming (and professional graphics) architecture. That is, unless you are a hobbyist. If you are? I would need to be wrong, or NVIDIA would need to somehow bring their enterprise SKU into an affordable price point. The xx0 class seems to have been pushed up and out of viability for consumers.

Or, again, I could just be wrong.

Video News

August 17, 2016 | 08:40 PM - Posted by Kjella (not verified)

Well, the Titan X is already a 250W part and 11% lower clocked than the GTX1080, if they released an OC version with frequency parity it'd already be a 250*1.11^2 = 300W+ part. If they increase size to 610mm^2 like the GP100 then it'd be another 30% increase or around 400W. Then you start to need exotic cooling solutions like the Radeon Fury X and the price would probably be 2xGTX1080 + worse yields like $1500 minimum. That's pretty harsh for a consumer card.

August 17, 2016 | 08:54 PM - Posted by SetiroN

Yeah, power consumption doesn't work like that. At all. You pulled those percentages right out of your ass.

And $1200 is already "pretty harsh".

August 17, 2016 | 10:20 PM - Posted by Anonymous (not verified)

$1200 will get 5 RX 470's, or 4 RX 480s and 1 RX 470, so add up all the extra FP compute that AMD's Polaris SKUs offer for $1200 dollars and that's a lot for some render farm workloads using all of that FP compute for ray tracing workload acceleration on the GPUs. It only takes 2 RX 480 at around $500 to get the close to the FP performance of one Titan X(Pascal) for some number crunching workloads. It's no wonder the bitcoin miners like AMD, and there still are some bitcoin mining uses for new mining algorithms not yet implemented in ASICs currently for AMD’s Polaris based GPUs! Nvidia sure has to bump up the clocks to get the Flops up there compared to Polaris

August 18, 2016 | 07:36 AM - Posted by Kjella (not verified)

So it's a lot more complicated, explained in detail here:

It's certainly beyond linear, usually x^2 has been a reasonable approximation around the normal frequency processors operate in - if you go for ultra-low power or extreme overclocking it's not.

Alternatively, it'd be roughly the size of two 180W TDP GTX1080s on one chip, so ~360W though bigger chips tend to run hotter which lowers efficiency, still thinking 350-400W as a ballpark figure.

We're talking about a card that'd beat GTX 1080 SLI here, not sure why you think those numbers would be unreasonable. A 15 billion transistor GPU will be very expensive and very power hungry.

August 17, 2016 | 09:38 PM - Posted by John H (not verified)

Scott -

"Typically, 0 corresponds to a ~550-600mm2 design, which is about as larger of a design that fabrication labs can create without error-prone techniques, like multiple exposures. "

Chip fabs are already using multiple exposures to produce chips below 20nm; I would recommend rewording this to indicate the real limit is the "reticle limit" which prevents them from creating bigger die sizes on the wafers, (and the "0" is usually pretty close to that limit). Note that this doesn't mean they can't stack logic chips in the future, to effectively have more than 600mm2 worth of die space, like flash :).

August 17, 2016 | 10:10 PM - Posted by Scott Michaud

Yeah, I'll be a bit more elaborate. Thanks.

August 18, 2016 | 12:39 PM - Posted by Matt1685 (not verified)

"If this is the case, we could see the GPGPU folks receiving a new architecture once every second gaming (and professional graphics) architecture."

Volta will be used in the Summit and Sierra supercomputers. This has been known for years and no changes have been announced. Relatively recently performance estimates and the node count were stated for Summit and the numbers are not possible with Pascal's abilities. They would be possible with NVLink 2.0 and a significant increase in FP64 performance over Pascal. Volta, therefore, will include FP64-oriented parts. So GPGPU folks will not be missing out on an upgrade the next generation (Volta).

Volta will probably introduce more significant changes to the basic compute architecture than Pascal introduced. Pascal added a bunch of new features and optimization of the pipelines, but Maxwell introduced more radical architectural changes.

The reason Maxwell didn't include an FP64 oriented part could be due two a couple factors. Firstly, I think there was uncertainty whether TSMC's 20nm wpuld be a viable process for GPUs until relatively late in the design cycle. It's possible a Maxwell FP64 part was contingent on Maxwell being a 20nm part. Secondly, NVIDIA made a strategic shift towards deep learning, leading to the introduction of Pascal, and its deep-learning oriented features, into their road map. It's possible that FP64 in Maxwell was scrapped in order to focus resources on the development of Pascal, which, with its FP16, NVLink, and greatly increased memory bandwidth provides significant advantages over a theoretical Maxwell based GPGPU card (In fact there are Maxwell based GPGPU cards, but they don't have significant FP64 capabilities). NVIDIA clearly sees deep learning as being very important long term, more important than HPC. In fact, they see HPC embracing deep learning as well.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.