Subject: Systems | March 9, 2017 - 07:01 AM | Scott Michaud
Tagged: nvidia, microsoft, hgx-1, GP100, dgx-1
When NVIDIA announced the Pascal architecture at last year’s GTC, they started with the GP100 architecture that was to be initially available in their $129,000 DGX-1 PC. In fact, this device contained eight of those “Big Pascal” GPUs that are connected together by their NVLink interconnection.
Now, almost a full year later, Microsoft, NVIDIA, and Ingrasys have announced the HGX-1 system. It, too, will contain eight GP100 GPUs through eight Tesla P100 accelerators. On the CPU side of things, Microsoft is planning on utilizing the next generation of x86 processors, Intel Skylake (which we assume means Skylake-X) and AMD Naples in these "Project Olympus" servers. Future versions could also integrate Intel FPGAs for an extra level of acceleration. ARM64 is another goal of theirs, but in the more distant future.
At the same time, NVIDIA has also announced, through a single-paragraph statement, that they are joining the Open Compute Project. This organization contains several massive players in the data center market, spanning from Facebook to Rackspace to Bank of America.
Whenever it arrives, the HGX-1 will be intended for cloud-based AI computations. Four of these machines are designed to be clustered together at high bandwidth, which I estimate would have north of 160 TeraFLOPs of double-precision (FP64) or 670 TeraFLOPs of half-precision (FP16) performance in the GPUs alone, depending on final clocks.
NVIDIA P100 comes to Quadro
At the start of the SOLIDWORKS World conference this week, NVIDIA took the cover off of a handful of new Quadro cards targeting professional graphics workloads. Though the bulk of NVIDIA’s discussion covered lower cost options like the Quadro P4000, P2000, and below, the most interesting product sits at the high end, the Quadro GP100.
As you might guess from the name alone, the Quadro GP100 is based on the GP100 GPU, the same silicon used on the Tesla P100 announced back in April of 2016. At the time, the GP100 GPU was specifically billed as an HPC accelerator for servers. It had a unique form factor with a passive cooler that required additional chassis fans. Just a couple of months later, a PCIe version of the GP100 was released under the Tesla GP100 brand with the same specifications.
Today that GPU hardware gets a third iteration as the Quadro GP100. Let’s take a look at the Quadro GP100 specifications and how it compares to some recent Quadro offerings.
|Quadro GP100||Quadro P6000||Quadro M6000||Full GP100|
|FP32 CUDA Cores / SM||64||64||64||64|
|FP32 CUDA Cores / GPU||3584||3840||3072||3840|
|FP64 CUDA Cores / SM||32||2||2||32|
|FP64 CUDA Cores / GPU||1792||120||96||1920|
|Base Clock||1303 MHz||1417 MHz||1026 MHz||TBD|
|GPU Boost Clock||1442 MHz||1530 MHz||1152 MHz||TBD|
|FP32 TFLOPS (SP)||10.3||12.0||7.0||TBD|
|FP64 TFLOPS (DP)||5.15||0.375||0.221||TBD|
|Memory Interface||1.4 Gbps
|Memory Bandwidth||716 GB/s||432 GB/s||316.8 GB/s||?|
|Memory Size||16GB||24 GB||12GB||16GB|
|TDP||235 W||250 W||250 W||TBD|
|Transistors||15.3 billion||12 billion||8 billion||15.3 billion|
|GPU Die Size||610mm2||471 mm2||601 mm2||610mm2|
There are some interesting stats here that may not be obvious at first glance. Most interesting is that despite the pricing and segmentation, the GP100 is not the de facto fastest Quadro card from NVIDIA depending on your workload. With 3584 CUDA cores running at somewhere around 1400 MHz at Boost speeds, the single precision (32-bit) rating for GP100 is 10.3 TFLOPS, less than the recently released P6000 card. Based on GP102, the P6000 has 3840 CUDA cores running at something around 1500 MHz for a total of 12 TFLOPS.
GP100 (full) Block Diagram
Clearly the placement for Quadro GP100 is based around its 64-bit, double precision performance, and its ability to offer real-time simulations on more complex workloads than other Pascal-based Quadro cards can offer. The Quadro GP100 offers 1/2 DP compute rate, totaling 5.2 TFLOPS. The P6000 on the other hand is only capable of 0.375 TLOPS with the standard, consumer level 1/32 DP rate. Inclusion of ECC memory support on GP100 is also something no other recent Quadro card has.
Raw graphics performance and throughput is going to be questionable until someone does some testing, but it seems likely that the Quadro P6000 will still be the best solution for that by at least a slim margin. With a higher CUDA core count, higher clock speeds and equivalent architecture, the P6000 should run games, graphics rendering and design applications very well.
There are other important differences offered by the GP100. The memory system is built around a 16GB HBM2 implementation which means more total memory bandwidth but at a lower capacity than the 24GB Quadro P6000. Offering 66% more memory bandwidth does mean that the GP100 offers applications that are pixel throughput bound an advantage, as long as the compute capability keeps up on the backend.
Is Enterprise Ascending Outside of Consumer Viability?
So a couple of weeks have gone by since the Quadro P6000 (update: was announced) and the new Titan X launched. With them, we received a new chip: GP102. Since Fermi, NVIDIA has labeled their GPU designs with a G, followed by a single letter for the architecture (F, K, M, or P for Fermi, Kepler, Maxwell, and Pascal, respectively), which is then followed by a three digit number. The last digit is the most relevant one, however, as it separates designs by their intended size.
Typically, 0 corresponds to a ~550-600mm2 design, which is about as larger of a design that fabrication labs can create without error-prone techniques, like
multiple exposures (update for clarity: trying to precisely overlap multiple designs to form a larger integrated circuit). 4 corresponds to ~300mm2, although GM204 was pretty large at 398mm2, which was likely to increase the core count while remaining on a 28nm process. Higher numbers, like 6 or 7, fill back the lower-end SKUs until NVIDIA essentially stops caring for that generation. So when we moved to Pascal, jumping two whole process nodes, NVIDIA looked at their wristwatches and said “about time to make another 300mm2 part, I guess?”
The GTX 1080 and the GTX 1070 (GP104, 314mm2) were born.
NVIDIA already announced a 600mm2 part, though. The GP100 had 3840 CUDA cores, HBM2 memory, and an ideal ratio of 1:2:4 between FP64:FP32:FP16 performance. (A 64-bit chunk of memory can store one 64-bit value, two 32-bit values, or four 16-bit values, unless the register is attached to logic circuits that, while smaller, don't know how to operate on the data.) This increased ratio, even over Kepler's 1:6 FP64:FP32, is great for GPU compute, but wasted die area for today's (and tomorrow's) games. I'm predicting that it takes the wind out of Intel's sales, as Xeon Phi's 1:2 FP64:FP32 performance ratio is one of its major selling points, leading to its inclusion in many supercomputers.
Despite the HBM2 memory controller supposedly being actually smaller than GDDR5(X), NVIDIA could still save die space while still providing 3840 CUDA cores (despite disabling a few on Titan X). The trade-off is that FP64 and FP16 performance had to decrease dramatically, from 1:2 and 2:1 relative to FP32, all the way down to 1:32 and 1:64. This new design comes in at 471mm2, although it's $200 more expensive than what the 600mm2 products, GK110 and GM200, launched at. Smaller dies provide more products per wafer, and, better, the number of defective chips should be relatively constant.
Anyway, that aside, it puts NVIDIA in an interesting position. Splitting the xx0-class chip into xx0 and xx2 designs allows NVIDIA to lower the cost of their high-end gaming parts, although it cuts out hobbyists who buy a Titan for double-precision compute. More interestingly, it leaves around 150mm2 for AMD to sneak in a design that's FP32-centric, leaving them a potential performance crown.
Image Credit: ExtremeTech
On the other hand, as fabrication node changes are becoming less frequent, it's possible that NVIDIA could be leaving itself room for Volta, too. Last month, it was rumored that NVIDIA would release two architectures at 16nm, in the same way that Maxwell shared 28nm with Kepler. In this case, Volta, on top of whatever other architectural advancements NVIDIA rolls into that design, can also grow a little in size. At that time, TSMC would have better yields, making a 600mm2 design less costly in terms of waste and recovery.
If this is the case, we could see the GPGPU folks receiving a new architecture once every second gaming (and professional graphics) architecture. That is, unless you are a hobbyist. If you are? I would need to be wrong, or NVIDIA would need to somehow bring their enterprise SKU into an affordable price point. The xx0 class seems to have been pushed up and out of viability for consumers.
Or, again, I could just be wrong.
Subject: Graphics Cards | July 6, 2016 - 11:56 PM | Scott Michaud
Tagged: titan, pascal, nvidia, gtx 1080 ti, gp102, GP100
Normally, I pose these sorts of rumors as “Well, here you go, and here's a grain of salt.” This one I'm fairly sure is bogus, at least to some extent. I could be wrong, but especially the GP100 aspects of it just doesn't make sense.
Before I get to that, the rumor is that NVIDIA will announce a GeForce GTX Titan P at Gamescom in Germany. The event occurs mid-August (17th - 21st) and it has been basically Europe's E3 in terms of gaming announcements. It also overlaps with Europe's Game Developers Conference (GDC), which occurs in March for us. The rumor says that it will use GP100 (!?!) with either 12GB of VRAM, 16GB of VRAM, or two variants as we've seen with the Tesla P100 accelerator.
The rumor also acknowledges the previously rumored GP102 die, claims that it will be for the GTX 1080 Ti, and suggests that it will have up to 3840 CUDA cores. This is the same number of CUDA cores as the GP100, which is where I get confused. This would mean that NVIDIA made a special die, which other rumors claim is ~450mm2, for just the GeForce GTX 1080 Ti.
I mean, it's possible that NVIDIA would split the GTX 1080 Ti and the next Titan by similar gaming performance, just with better half- and double-precision performance and faster memory for GPGPU developers. That would be a very weird to me, though, developing two different GPU dies for the consumer market with probably the same gaming performance.
And they would be announcing the Titan P first???
The harder to yield one???
When the Tesla version isn't even expected until Q4???
I can see it happening, but I seriously doubt it. Something may be announced, but I'd have to believe it will be at least slightly different from the rumors that we are hearing now.
Subject: Graphics Cards | June 20, 2016 - 01:57 PM | Scott Michaud
Tagged: tesla, pascal, nvidia, GP100
GP100, the “Big Pascal” chip that was announced at GTC, will be coming to PCIe for enterprise and supercomputer customers in Q4 2016. Previously, it was only announced using NVIDIA's proprietary connection. In fact, they also gave themselves some lead time with their first-party DGX-1 system, which retails for $129,000 USD, although we expect that was more for yield reasons. Josh calculated that each GPU in that system is worth more than the full wafer that its die was manufactured on.
This brings us to the PCIe versions. Interestingly, they have been down-binned from the NVLink version. The boost clock has been dropped to 1300 MHz, from 1480 MHz, although that is matched with a slightly lower TDP (250W versus the NVLink's 300W). This lowers the FP16 performance to 18.7 TFLOPs, down from 21.2, FP32 performance to 9.3 TFLOPs, down from 10.6, and FP64 performance to 4.7 TFLOPs, down from 5.3. This is where we get to the question: did NVIDIA reduce the clocks to hit a 250W TDP and be compatible with the passive cooling technology that previous Tesla cards utilize, or were the clocks dropped to increase yield?
They are also providing a 12GB version of the PCIe Tesla P100. I didn't realize that GPU vendors could selectively disable HBM2 stacks, but NVIDIA disabled 4GB of memory, which also dropped the bus width to 3072-bit. You would think that the simplicity of the circuit would want to divide work in a power-of-two fashion, but, knowing that they can, it makes me wonder why they did. Again, my first reaction is to question GP100 yield, but you wouldn't think that HBM, being such a small part of the die, is something that they can reclaim a lot of chips by disabling a chunk, right? That is, unless the HBM2 stacks themselves have yield issues -- which would be interesting.
There is also still no word on a 32GB version. Samsung claimed the memory technology, 8GB stacks of HBM2, would be ready for products in Q4 2016 or early 2017. We'll need to wait and see where, when, and why it will appear.
First, Some Background
NVIDIA's Rumored GP102
When GP100 was announced, Josh and I were discussing, internally, how it would make sense in the gaming industry. Recently, an article on WCCFTech cited anonymous sources, which should always be taken with a dash of salt, that claimed NVIDIA was planning a second architecture, GP102, between GP104 and GP100. As I was writing this editorial about it, relating it to our own speculation about the physics of Pascal, VideoCardz claims to have been contacted by the developers of AIDA64, seemingly on-the-record, also citing a GP102 design.
I will retell chunks of the rumor, but also add my opinion to it.
In the last few generations, each architecture had a flagship chip that was released in both gaming and professional SKUs. Neither audience had access to a chip that was larger than the other's largest of that generation. Clock rates and disabled portions varied by specific product, with gaming usually getting the more aggressive performance for slightly better benchmarks. Fermi had GF100/GF110, Kepler had GK110/GK210, and Maxwell had GM200. Each of these were available in Tesla, Quadro, and GeForce cards, especially Titans.
Maxwell was interesting, though. NVIDIA was unable to leave 28nm, which Kepler launched on, so they created a second architecture at that node. To increase performance without having access to more feature density, you need to make your designs bigger, more optimized, or more simple. GM200 was giant and optimized, but, to get the performance levels it achieved, also needed to be more simple. Something needed to go, and double-precision (FP64) performance was the big omission. NVIDIA was upfront about it at the Titan X launch, and told their GPU compute customers to keep purchasing Kepler if they valued FP64.
Subject: General Tech | April 28, 2016 - 01:47 PM | Jeremy Hellstrom
Tagged: nvidia, GP100, pascal
The Tech Report takes you on a walk through NVIDIA's HPC products to show you just what is interesting about the Tesla P100 HPC which Jen-Hsun Huang introduced us to. The background gives you an idea of how much has changed from their first forays into HPC to this new 16nm process, 610mm² chip with 56 SMs. If you missed out on the presentation or wanted some more information about how they pulled off FP16 on natively FP32 hardware or how the cache of this chip was set up then click on over and read it for yourself.
"Nvidia's GP100 "Pascal" GPU launched on the Tesla P100 HPC accelerator a couple weeks ago. Join us as we take an in-depth look at what we know about this next-generation graphics processor so far, and what it might mean for the consumer GeForces of the future."
Here is some more Tech News from around the web:
- Microsoft delivers new previews of Windows Server 2016 and System Centre 2016 @ The Inquirer
- Time for a patch: six vulns fixed in NTP daemon @ The Register
- Searching for USB Power Supplies that Won’t Explode @ Hack a Day
- Hackers so far ahead of defenders it's not even a game @ The Register
- Trouble at t'spinning rust mill: Disk drive production is about to head south @ The Register
- Tech ARP 2016 Power Bank Giveaway
Subject: General Tech | April 7, 2016 - 02:47 PM | Ken Addison
Tagged: VR, vive, video, tesla p100, steamvr, Spectre 13.3, rift, podcast, perfmon, pascal, Oculus, nvidia, htc, hp, GP100, Bristol Ridge, APU, amd
PC Perspective Podcast #394 - 04/07/2016
Join us this week as we discuss measuring VR Performance, NVIDIA's Pascal GP100, Bristol Ridge APUs and more!
The URL for the podcast is: http://pcper.com/podcast - Share with your friends!
- iTunes - Subscribe to the podcast directly through the Store (audio only)
- RSS - Subscribe through your regular RSS reader (audio only)
- MP3 - Direct download link to the MP3 file
This episode of the PC Perspective Podcast is sponsored by Lenovo!
Hosts: Ryan Shrout, Jeremy Hellstrom, Josh Walrath and Allyn Malventano
Program length: 1:32:19
Week in Review:
0:46:25 This week’s podcast is brought to you by Casper. Use code PCPER at checkout for $50 towards your order!
News items of interest:
Hardware/Software Picks of the Week
93% of a GP100 at least...
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.