During the opening keynote to NVIDIA’s GPU Technology Conference, CEO Jen-Hsun Huang formally unveiled the latest GPU architecture and the first product based on it. The Tesla V100 accelerator is based on the Volta GPU architecture and features some amazingly impressive specifications. Let’s take a look.

  Tesla V100 GTX 1080 Ti Titan X (Pascal) GTX 1080 GTX 980 Ti TITAN X GTX 980 R9 Fury X R9 Fury
GPU GV100 GP102 GP102 GP104 GM200 GM200 GM204 Fiji XT Fiji Pro
GPU Cores 5120 3584 3584 2560 2816 3072 2048 4096 3584
Base Clock 1480 MHz 1417 MHz 1607 MHz 1000 MHz 1000 MHz 1126 MHz 1050 MHz 1000 MHz
Boost Clock 1455 MHz 1582 MHz 1480 MHz 1733 MHz 1076 MHz 1089 MHz 1216 MHz
Texture Units 320 224 224 160 176 192 128 256 224
ROP Units 128 (?) 88 96 64 96 96 64 64 64
Memory 16GB 11GB 12GB 8GB 6GB 12GB 4GB 4GB 4GB
Memory Clock 878 MHz (?) 11000 MHz 10000 MHz 10000 MHz 7000 MHz 7000 MHz 7000 MHz 500 MHz 500 MHz
Memory Interface 4096-bit (HBM2) 352-bit 384-bit G5X 256-bit G5X 384-bit 384-bit 256-bit 4096-bit (HBM) 4096-bit (HBM)
Memory Bandwidth 900 GB/s 484 GB/s 480 GB/s 320 GB/s 336 GB/s 336 GB/s 224 GB/s 512 GB/s 512 GB/s
TDP 300 watts 250 watts 250 watts 180 watts 250 watts 250 watts 165 watts 275 watts 275 watts
Peak Compute 15 TFLOPS 10.6 TFLOPS 10.1 TFLOPS 8.2 TFLOPS 5.63 TFLOPS 6.14 TFLOPS 4.61 TFLOPS 8.60 TFLOPS 7.20 TFLOPS
Transistor Count 21.1B 12.0B 12.0B 7.2B 8.0B 8.0B 5.2B 8.9B 8.9B
Process Tech 12nm 16nm 16nm 16nm 28nm 28nm 28nm 28nm 28nm
MSRP (current) lol $699 $1,200 $599 $649 $999 $499 $649 $549

While we are low on details today, it appears that the fundamental compute units of Volta are similar to that of Pascal. The GV100 has 80 SMs with 40 TPCs and 5120 total CUDA cores, a 42% increase over the GP100 GPU used on the Tesla P100 and 42% more than the GP102 GPU used on the GeForce GTX 1080 Ti. The structure of the GPU remains the same GP100 with the CUDA cores organized as 64 single precision (FP32) per SM and 32 double precision (FP64) per SM.

Click to Enlarge

Interestingly, NVIDIA has already told us the clock speed of this new product as well, coming in at 1455 MHz Boost, more than 100 MHz lower than the GeForce GTX 1080 Ti and 25 MHz lower than the Tesla P100.

Click to Enlarge

Volta adds in support for a brand new compute unit though, known as Tensor Cores. With 640 of these on the GPU die, NVIDIA directly targets the neural network and deep learning fields. If this is your first time hearing about Tensor, you should read up on its influence on the hardware markets, bringing forth an open-source software library for machine learning. Google has invested in a Tensor-specific processor already, and now NVIDIA throws its hat in the ring.

Adding Tensor Cores to Volta allows the GPU to do mass processing for deep learning, on the order of a 12x improvement over Pascal’s capabilities using CUDA cores only.

For users interested in standard usage models, including gaming, the GV100 GPU offers 1.5x improvement in FP32 computing, up to 15 TFLOPS of theoretical performance and 7.5 TFLOPS of FP64. Other relevant specifications include 320 texture units, a 4096-bit HBM2 memory interface and 16GB of memory on-module. NVIDIA claims a memory bandwidth of 900 GB/s which works out to 878 MHz per stack.

Maybe more impressive is the transistor count: 21.1 BILLION! NVIDIA claims that this is the largest chip you can make physically with today’s technology. Considering it is being built on TSMC's 12nm FinFET technology and has an 815 mm2 die size, I see no reason to doubt them.

Shipping is scheduled for Q3 for Tesla V100 – at least that is when NVIDIA is promising the DXG-1 system using the chip is promised to developers.

I know many of you are interested in the gaming implications and timelines – sorry, I don’t have an answer for you yet. I will say that the bump from 10.6 TFLOPS to 15 TFLOPS is an impressive boost! But if the server variant of Volta isn’t due until Q3 of this year, I find it hard to think NVIDIA would bring the consumer version out faster than that. And whether or not NVIDIA offers gamers the chip with non-HBM2 memory is still a question mark for me and could directly impact performance and timing.

More soon!!