Rumor: 16nm for NVIDIA's Volta Architecture

Subject: Graphics Cards | July 16, 2016 - 06:37 PM |
Tagged: Volta, pascal, nvidia, maxwell, 16nm

For the past few generations, NVIDIA has been roughly trying to release a new architecture with a new process node, and release a refresh the following year. This ran into a hitch as Maxwell was delayed a year, apart from the GTX 750 Ti, and then pushed back to the same 28nm process that Kepler utilized. Pascal caught up with 16nm, although we know that some hard, physical limitations are right around the corner. The lattice spacing for silicon at room temperature is around ~0.5nm, so we're talking about features the size of ~the low 30s of atoms in width.

View Full Size

This rumor claims that NVIDIA is not trying to go with 10nm for Volta. Instead, it will take place on the same, 16nm node that Pascal is currently occupying. This is quite interesting, because GPUs scale quite well with complexity changes, as they have many features with a relatively low clock rate, so the only real ways to increase performance are to make the existing architecture more efficient, or make a larger chip.

That said, GP100 leaves a lot of room on the table for an FP32-optimized, ~600mm2 part to crush its performance at the high end, similar to how GM200 replaced GK110. The rumored GP102, expected in the ~450mm2 range for Titan or GTX 1080 Ti-style parts, has some room to grow. Like GM200, however, it would also be unappealing to GPU compute users who need FP64. If this is what is going on, and we're totally just speculating at the moment, it would signal that enterprise customers should expect a new GPGPU card every second gaming generation.

That is, of course, unless NVIDIA recognized ways to make the Maxwell-based architecture significantly more die-space efficient in Volta. Clocks could get higher, or the circuits themselves could get simpler. You would think that, especially in the latter case, they would have integrated those ideas into Maxwell and Pascal, though; but, like HBM2 memory, there might have been a reason why they couldn't.

We'll need to wait and see. The entire rumor could be crap, who knows?

Source: Fudzilla

July 16, 2016 | 08:24 PM - Posted by Anonymous (not verified)

Volta was already pushed back. It was suppose to be after Maxwell and is no surprise it wont see a die-shrink

http://images.anandtech.com/doci/7900/OldRoadmap.jpg

Volta was moved back and Pascal took its place

July 16, 2016 | 08:44 PM - Posted by Scott Michaud

Yup! I remember that. It was like, over two years ago, though. I think NVIDIA didn't even announce that Maxwell would be kept on 28nm yet.

July 16, 2016 | 10:16 PM - Posted by patrickjp93 (not verified)

It would be more accurate to say Pascal was pushed back, because Pascal is the original Maxwell design for 20nm. When TSMC didn't produce a high-power 20nm node, Nvidia stripped a lot of what would make Pascal great and produced Maxwell. Thus, Nvidia got 2 product lines for a single R&D cost. Volta was not pushed back so much as Nvidia got a buffer product and wanted to make the most back on its R&D investments.

July 17, 2016 | 09:57 AM - Posted by Matt (not verified)

"It would be more accurate to say Pascal was pushed back, because Pascal is the original Maxwell design for 20nm."

No, I don't believe that Pascal in its current form could have been released for 20nm in 2014. NVIDIA is not that much ahead of AMD and Intel. NVIDIA knew they wouldn't, or at least there was a very strong possibility they wouldn't, use the 20nm node very early on. Here's a story from early 2012 with NVIDIA takling about TSMC's 20nm node: http://www.extremetech.com/computing/123529-nvidia-deeply-unhappy-with-t...

These things are not complete surprises. The customers are testing the processes as they are being developed. Notice the date on some of those slides.

Back at GTC 2010, NVIDIA listed Maxwell as having about 15 GFLOPS per Watt in double precision. Obviously this was changed, as Maxwell didn't have significant DP units. But even in 2013, before Pascal was unveiled, the only feature listed for Maxwell was Unified Virtual Memory, which it does have. No mention was ever made about 3D RAM, NVLINK,Unified memory (segmentation faults issued), or other features of Pascal.

Obviously features were shuffled around, with some getting delayed and perhaps others getting moved up. But it's not accurate to say that "Volta was pushed back" or "Pascal was pushed back" or "NVIDIA got 2 product lines for a single R&D cost". These architectures are not monoliths. Things changed in the roadmaps, NVIDIA continued to steadily pump in R&D spending as a function of time, and the architectures took shape differently than what was presented in earlier years.

July 17, 2016 | 09:31 AM - Posted by Matt (not verified)

You have to remember that 'Volta', 'Pascal', etc. are just internal company names. The 'Volta' being released in 2018 is certainly not the same architecture as the 'Volta' that was planned to be released in 2016 before 'Pascal' was inserted. One also has no idea the primary reasoning for the change. What we can say is that a re-assessment of features and focus happened leading to the insertion of 'Pascal'. It is unknown whether that was because 'Volta' technologies weren't ready and needed to be pushed back or because a strategic shift occurred and other technologies not included in that 'Volta' were deemed to be more important. The comments Jen-Hsun Huang made regarding NVIDIA deciding to go "all-in" on deep learning with Pascal suggest the latter, but who really knows? You don't know and I don't know, but you're wording it in such a way strongly suggesting it's the former of the two possibilities.

Regardless, the new Volta will build on top of Pascal, and presumably incorporate features planned for the old 'Volta' as well as features that were never planned for the old 'Volta'.

July 16, 2016 | 09:12 PM - Posted by Anonymous (not verified)

14nm and 16nm are going to be around for a while, maybe not as long as 28nm, but there are is lot of process tweaking to do to get more performance out of the newer process nodes. Nvidia needs to work on that hardware based async-compute management on its Volta micro-arch and working towards getting better performance with the latest graphics APIs. I just wonder what Nvidia is going to do to stop the graphics API's GPU multi-adaptor working on its GTX 1060 SKUs that don't have SLI available. Because the gaming industry looks to be going towards managing the multi-GPU load balancing via the new graphics APIs.

July 16, 2016 | 09:40 PM - Posted by Scott Michaud

I strongly doubt that NVIDIA would deliberately enumerate graphics devices incorrectly to the APIs. That's a much different issue than not providing SLI.

July 17, 2016 | 12:03 AM - Posted by Anonymous (not verified)

That is good, so no more worries about the GPU makers' lackluster support for multi-GPU in their respective drivers/middleware. There will be more R&D funding from the games/gaming engine developers and OS/Graphics API Developers going forward with the GPU makers out of the picture mostly for multi-GPU support going forward. GPU's should have a minimum of GPU driver complexity and be more under the control of the Graphics API's and games/gaming engines.

GPUs like any processor connected to the system should be just another graphic/compute resource to the OS/Graphics APIs, with the GPU makers supplying at the driver level only what is absolutely necessary to properly abstract away the GPU’s hardware at the driver level and allow the most efficient access to the GPU’s hardware feature sets. Let the other features like multi-GPU load balancing and other features reside in the OS/Graphics APIs and be developed under the larger R&D assets that the entire gaming industry and OS/Graphics API contributors can offer.

July 19, 2016 | 07:36 PM - Posted by Anonymous (not verified)

The multi-GPU issue in the short term is that game devs have no incentive to allow it. People who sold GPU's did.

I suppose the game ENGINE developers have incentive since they compete with each other.

We need SFR (Split Frame Rendering) to be natively at the core of the game engine and tools so that little extra effort is required.

We'll get there, but it takes YEARS to do these things well. When it really starts to take off there will be a shift towards multi-GPU on the same card.

I do expect the approx 2020 consoles to be dual-GPU.

July 17, 2016 | 04:24 AM - Posted by Anonymous (not verified)

"I just wonder what Nvidia is going to do to stop the graphics API's GPU multi-adaptor working on its GTX 1060 SKUs that don't have SLI available."

Nothing at all. The reviewer copy provided with the cards even explicitly states that 1060 will only work with Explicit Multiadapter and MDA mode, and NOT implicit multiadapter (i.e. traditional SLI).

July 16, 2016 | 11:28 PM - Posted by Shadowarez

Hope Nvidia learned there lesson and put Asynchronous compute in hardware in Volta instead of just brute forcing it in like they doing with Pascal.

It's going be interesting to see what excuse they come up with if not in Volta or if they try reuse the well put it in through drivers like they said they would with Maxwell.

July 17, 2016 | 09:20 AM - Posted by Matt (not verified)

"Hope Nvidia learned there lesson and put Asynchronous compute in hardware in Volta instead of just brute forcing it in like they doing with Pascal. "

Benchmarks I have seen suggest Pascal makes use of asynchronous compute very well.

July 19, 2016 | 07:43 PM - Posted by Anonymous (not verified)

Yep. So much misinformation. In the same AotS benchmark NVidia gained 6.8% whereas the RX-480 gained 8.5%. It's really not the 20% difference some cherry-picked benchmarks imply.

That's also not fully optimized for either one. In fact, we probably don't have ANYTHING that's really doing DX12 justice including the Time Spy demo.

So many "experts" crawl out of the woodwork it can get a bit frustrating but the rule of thumb is often to WAIT until we've got a lot more data to go by.

Pascal has dynamic load balancing which works quite well. I had some concerns several months ago, but not now. Once DX12 matures we'll see games optimized more for NVidia and some more for AMD just like normal.

Heck, I'm just happy AMD is staying in the picture to keep things competitive.

July 17, 2016 | 09:18 AM - Posted by Matt (not verified)

"That said, GP100 leaves a lot of room on the table for an FP32-optimized, ~600mm2 part to crush its performance at the high end, similar to how GM200 replaced GK110. The rumored GP102, expected in the ~450mm2 range for Titan or GTX 1080 Ti-style parts, has some room to grow. Like GM200, however, it would also be unappealing to GPU compute users who need FP64. If this is what is going on, and we're totally just speculating at the moment, it would signal that enterprise customers should expect a new GPGPU card every second gaming generation."

Volta is a key component in the Summit and Sierra supercomputers due to be operational in early 2018. Therefore, your speculation about FP64 not being present in Volta is clearly wrong. NVIDIA will not be skipping a generation for GPGPU purposes where Volta is concerned.

July 17, 2016 | 11:57 AM - Posted by jabbadap (not verified)

Agreed Nvidia has to hurry with Volta hpc chip to make it to the timeline. Just wondering how in the hell can they get more fp64 out of 16nm FF+. GP100 is huge 610mm² chip with 1:2 fp64:fp32 ratio. Will they make GV100 to 1:1 ratio chip, with some fat 32 cuda core SMs?!

Maxwell should have been 20nm chip but that process note failed hard, and maxwell were designed with 28nm instead and nvidia strip fp64 functionally to make it fit to 600mm². They can't do it now, big Volta _must_ have strong fp64 prior to contract with summit and sierra.

July 17, 2016 | 12:10 PM - Posted by svnowviwvn

Quote: Agreed Nvidia has to hurry with Volta hpc chip to make it to the timeline. Just wondering how in the hell can they get more fp64 out of 16nm FF+. GP100 is huge 610mm² chip with 1:2 fp64:fp32 ratio.

Right now fp64 and fp32 are individual units thus both take up space on the die. Maybe they can find a way to combine them into a single fp64/fp32 unit that could be maybe a few percent higher than a single fp64 unit and then get rid of all those individual fp32 units.

July 17, 2016 | 01:32 PM - Posted by Anonymous Nvidia User (not verified)

Yes less would be more efficient. But without raw numbers of cores Nvidia wouldn't be able to take much advantage of asynchronous compute. They have way less cores now than AMD and that's why AMD can show better async numbers because they have more idle cores.

Gasp. Who would have thought an Nvidia is more efficient than a Radeon. Everyone but AMD fanboys raise their hands. A few stammer out our async performance beats yours. We get a few frames more in our gaming evolved titles than you. True, but when you have to consume up to double the wattage of a comparable Nvidia to get there; it's called a pyhrric victory.

July 17, 2016 | 03:22 PM - Posted by Matt (not verified)

"Right now fp64 and fp32 are individual units thus both take up space on the die. Maybe they can find a way to combine them into a single fp64/fp32 unit that could be maybe a few percent higher than a single fp64 unit and then get rid of all those individual fp32 units."

I believe NVIDIA already used the arrangement you are suggesting in Fermi. Starting with Kepler they switched to dedicated FP64 circuitry, perhaps for power efficiency reasons (I'm guessing), and maybe because it allows them to easily build GPUs with varying amount of double precision performance.

July 17, 2016 | 04:19 PM - Posted by Scott Michaud

That seems likely to me. Being able to separate FP32 and FP64 performance takes die complexity. More logic is required to separate tasks, etc.

If games aren't going to use it, though, then the added complexity is less than what 1:2 FP64 requires, thus it's a net win.

July 17, 2016 | 04:08 PM - Posted by Matt (not verified)

"Agreed Nvidia has to hurry with Volta hpc chip to make it to the timeline. Just wondering how in the hell can they get more fp64 out of 16nm FF+. GP100 is huge 610mm² chip with 1:2 fp64:fp32 ratio. Will they make GV100 to 1:1 ratio chip, with some fat 32 cuda core SMs?!

Maxwell should have been 20nm chip but that process note failed hard, and maxwell were designed with 28nm instead and nvidia strip fp64 functionally to make it fit to 600mm². They can't do it now, big Volta _must_ have strong fp64 prior to contract with summit and sierra."

Yes I am very curious how they plan to get the performance and efficiency improvements they are targeting with Volta out of the 16nm node. Certainly some can come from improvement of and experience with the node itself. But they seem to be claiming a greater than 40% gain in double precision FLOPS performance for Volta compared with Pascal. In fact for single precision general matrix multiply they look to be claiming almost 2x the efficiency.

July 17, 2016 | 01:41 PM - Posted by Anonymous Nvidia User (not verified)

Seriously though I could care less about async performance. Nvidia needs to keep doing what they are doing and make the most efficient cards they can. Even if they only get 1% async performance. It means their cards are operating at peak efficiency which is loads better.

If Nvidia ever mades a gaming card that consumes the same watts as an AMD same gen card, they would curb stomp AMD. Believe it or not they still need the competition (if you can call AMD that) or they become a monopoly. Wait no they don't Intel is the leading graphics producer in the world. LOL

July 17, 2016 | 06:20 PM - Posted by Anonymous (not verified)

Nvidia seems to be pushing clock speed a lot harder. This makes some me sense; if you can't go bigger (more units), then going for higher clock is an alternative way to increase performance. It can't be pushed very far though because the deeper pipelining can start to take more die area itself and the power consumption can increase significantly with higher clock. Pushing extreme clock speed also takes a lot more developement dollars. Nvidia tries to make it sound like what they achieved is somehow magical (or a miracle), but it is just standard design and path optimization. There is always a critical path, and once you fix that path, a bunch more paths will become the limiters. Repeat until you run out of R&D money or time. I don't know if the deeper pipeline is a latency issue for GPU graphics the way it is for CPUs. Higher latency for increased throughout is probably better for most HPC though.

Pushing clock speed with a single monolithic die has the same effect that Intel has counted on for CPUs. With high single GPU (or single CPU thread) performance available, no one does multi-gpu (or multi-cpu) optimization which ensures high profit margins for the single monolithic GPU or CPU core. This also prevents competition from other companies that perhaps can't afford all of the custom design to reach such high clock speeds. In my opinion, AMD has done more to push the market forward (Mantle, specifically) while Intel and Nvidia have just held it back, trying to protect their profit margins. At least we get a lot of well threaded CPU code these days, since Intel's clock speed push ran out of steam. This isn't actually saying anything good about AMD though. They had to push the market forward with Mantle; what other choice did they have? We probably got DX12 mostly because the Xbox One, with 8 low power CPU cores and AMD graphics, was at a disadvantage with DX11. If it had been Nvidia graphics instead and an inefficient (power consumption) "fat" CPU core, we would probably would have been stuck with DX11 (mostly single threaded) for a long time.

Being a fan boy to support Intel or Nvidia profits is very stupid, unless you are a share holder. If you are a share holder, then I would say go back to the stock market forums where you belong so enthusiast can actually have a discussion on the technology. At this point, as an enthusiast, supporting Nvidia is just holding the technology back. Nvidia is obviously trying to downplay multi-gpu because it may help AMD. It reminds me of the launch of the Pentium 4, when Intel tried to kill DDR memory because they had chosen Rambus memory. The Pentium 4 didn't do that bad with Rambus, but it was absolutely terrible with SDR, yet Intel still released a terrible product with SDR rather than DDR to try and kill of DDR memory. Ramping up DDR would help AMD's K8. Multi-gpu is where we are headed eventually just like with CPUs; pushing clock speed is not a long term solution, especially if we are stuck at the same process technology for a long time. The quicker we can move to multi-gpu the better.

July 17, 2016 | 07:20 PM - Posted by Theo (not verified)

With node shrinks coming fewer and farther between Nvidia will probably adopt a tick-tock strategy just like Intel. Pascal isn't a big architectural jump; it basically carries over the same shader design to 16nmFF with scheduling and geometry processing improvements. Volta will likely be a big architectural jump just like Maxwell was on 28nm.

July 18, 2016 | 05:50 PM - Posted by Ophelos

If you watch this video, the guy says that they're working on 10nm GPUs.

https://www.youtube.com/watch?v=pRz_CG3DZb4

July 23, 2016 | 04:07 PM - Posted by Anonymous (not verified)

If you are talking about 8:08 mark, the guy in the video is talking about test 10 nm chips, it was never a secret Nvidia is doing test chips on 7 and 10 nm. For example:
http://www.fool.com/investing/2016/07/04/nvidia-corporation-and-the-volt...

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.