Review Index:

NVIDIA TITAN V Review Part 2: Compute Performance

Author: Ryan Shrout
Manufacturer: NVIDIA

Rendering and Compute Performance

Luxmark 3.1 

View Full Size

OpenCL rendering performance is important for workstation-level graphics card hardware. Luxmark, one of the most widely used OpenCL performance tests, provides a good look at how different GPUs perform in typical OpenCL rendering workloads. For this test, we are using the "Hotel" Scene in Luxmarkthe most compute-intensive scene, consisting of almost 5000 triangles.

View Full Size

Going from the NVIDIA Titan Xp in Luxmark to the Titan V, we see a massive 78% increase in performance. Similarly, the Titan V manages to double the score of the AMD Radeon Vega 64.

Cinebench R15

The performance depends on various factors, such as the GPU processor on your hardware, on the drivers used. The graphics card has to display a huge amount of geometry (nearly 1 million polygons) and textures, as well as a variety of effects, such as environments, bump maps, transparency, lighting and more to evaluate the performance across different disciplines and give a good average overview of the capabilities of your graphics hardware. The result is measured in frames per second (fps). The higher the number, the faster your graphics card is.

View Full Size

While the CPU portion of Cinebench is a tool we often use to evaluate the performance of new processors, it also offers an OpenGL test for assessing GPU performance.

Although both the NVIDIA offerings are about 50% faster than the AMD Vega 64, there's virtually no difference between the Titan V and the previous generation Titan Xp in this test.

V-Ray Benchmark

V-Ray is popular third-party renderer that plugs into the most powerful CAD and 3D modeling applications. With plugins for 3ds Max, Maya, Revit, Rhino, and more, V-Ray is widely used for high-quality renderings in commercial applications such as architecture and product design.

View Full Size

V-Ray Benchmark is a free standalone application which allows users to evaluate hardware without having to install a full suite of software or provide a software license. For AMD GPUs, V-Ray Benchmark uses an OpenCL renderer, while for NVIDIA GPUs a CUDA-enabled renderer is used.

View Full Size

In V-Ray rendering, the Titan V is over 3.3x the speed of Vega 64, and over 25% faster than the previous generation Titan Xp.

SiSoft Sandra 2017 GPGPU Compute

View Full Size

SiSoft Sandra is a suite of benchmarks covering a wide array of system hardware and functionality, including an extensive range of GPGPU tests, which we are looking at today. 

View Full Size

The first GPGPU test in SiSoft Sandra evaluates the shader performance of a given GPU at different precision levels. 

While both the Titan Xp and AMD Vega 64 are within 25% of the performance of the Titan V in single precision workloads, the GV100 silicon in the Titan V obliterates every other GPU when it comes to double precision workloads. The Titan V is 14.7x faster than the Titan Xp and 6.6x faster than the Vega 64!

SiSoft Sandra 2017 GPGPU Financial Analysis

View Full Size

With the GPGPU Financial Analysis test in Sandra 2017, we see a more real-world application of the double precision floating point performance with 8x the performance of the Vega 64, and over 13x the compute performance of the Titan Xp.

SiSoft Sandra 2017 GPGPU Scientific Analysis

View Full Size

In the Scientific analysis test, we can see the increased double precision performance of the Titan V in N-Body simulation and GMM workloads, but the playing field is more leveled in single precision FFT workload.

SiSoft Sandra 2017 GPGPU Image Processing

View Full Size

The image processing test shows a different outcome than the rest of our previous benchmark. AMD's Vega 64 GPU takes the lead here with an almost 25% lead over the Titan V. This is due to benchmark being able to take advantage of half-precision (FP16) shaders.

The Vega architecture has the ability to run half-precision workloads at exactly double the rate of FP32 operations, which is a huge advantage in tasks that don't need high precision as we see here. This is a good sign for the potential of developer optimaztion for Vega changing the competitive landscape.

Video News

December 15, 2017 | 01:04 PM - Posted by MoreTensorCoreBenchmarksSVP (not verified)

Are the current benchmarks using much of the Tensor Cores yet? I'm seeing one sort of AI usage for graphics software(1) so Maybe those Tensor Cores ca have graphics application usage. I'd Like to Know if there are any graphics filter plugins that may make us of the Tensor Core's Matrix Math functionality on those Tensor Cores in a non AI related way also the same way that a GPU's shader cores can be used for other compute related usage. Because those Tensor Cores have other than AI usage if they are good at Matrix math, as lots of current graphics application filtering make us of Matrix math, and Tensor Cores are just didicated hardware for accelerating Matrix/Tensor(2d, 3d Matrix structures) Calculations.


"Adobe previews new Photoshop feature that uses AI to select subjects"

December 15, 2017 | 01:56 PM - Posted by Jabbadap

What is the Direct3d 12 feature set for Volta, is it still the same as Pascal or full?

Could you run feature set checker for it:

December 15, 2017 | 04:28 PM - Posted by asH95 (not verified)

I don't know where you got your Vega Frontier Viewperf scores but these mine, they aren't even my best.

3Dsmax-05 - 151.01
catia-04 - 141.85
creo-01 - 90.74
energy-01 - 21.24
maya-04 - 108.80
medical-01 - 101.08
showcase-01 = 127
snx-02 - 164.98
sw-03 - 118.05

They should be in the Viewperf database

December 17, 2017 | 09:06 PM - Posted by RGB LED fan (not verified)

They might not have a Vega Frontier card anymore, so their results are from 6 months ago. I imagine drivers would have improved the numbers a bit since then.

Ryan's original Vega FE review, with viewperf numbers that exactly match the ones quoted in this article:

December 15, 2017 | 04:50 PM - Posted by asH95 (not verified)

When/where do you mention this a 12nm part???

The foundry claims that 12FDX can deliver “15 percent more performance over current FinFET technologies” with “50 percent lower power consumption,” at a cost lower than existing 16nm FinFET devices.

December 15, 2017 | 04:54 PM - Posted by asH95 (not verified)

ignore last comment that's GloFo 12nm spec not TSMC

Although TSMC’s 12nm process was originally planned to be introduced as a fourth-generation 16nm optimization, it will now be introduced as an independent process technology instead. Three of the company’s partners have already received tape-outs on 10nm designs and the process is expected to start generating revenues by early 2017. Apple and MediaTek are likely to be the first with 10nm TSMC-based products, while the 12nm node should become a useful enhancement to fill the competition gap before more partners are capable of building 10nm chips.

December 16, 2017 | 08:58 AM - Posted by Aparsh335i (not verified)

I assume you meant early 2018, not 2017 - right?

December 16, 2017 | 10:36 AM - Posted by asH95 (not verified)

Titan V is a 12nm part

December 16, 2017 | 12:54 PM - Posted by IfYasDontHaveTheROPsYaDontsGetTheFPS (not verified)

But Volta has the on that 12nm process the same number of ROPs(1) as the GP102 based top end variants 96 and ROPs. So most of Volta's performance gains in gaming can be attributed to process node shrink/tweaks and Volta/GV100's larger L2 cache etc. And the GPU makers need to stop treating their ROP/TMU units as magic black boxes and start providing more information.

I'm suspecting that both AMD's and Nvidia's ROP technology does not change much "Generation" to "Generation" and that its mostly the shader to ROP to TMU unit ratios that are what is giving the better gaming performance and that AMD just needs to take The Vega Micro-Arch base die design and refactor the Shader to TMU to ROP ratios more towards gaming workloads. And really it's that more ROPs figure that is most responsible for Nvidia's Better FPS Metrics relative to AMD's current Vega 10 base die design/blueprints.

Nvidia has all of its many base GPU designs/blueprints with GP100/GV100 being more compute heavy and GP102/GV102 likewise but with a little less shader resources for professional graphcs workloads. GP102/GV100 dies have the most ROPs available with the GP104/GV104 dies starting out the gaming focused SKUs with their Shader counts really trimed back on down to the GP106-108/GV106-108 SKUs that even have less resources and much narrower busses.

The GPUs micro-arch do not play much of any role in the matter as it's the execution resources on the Shaders and ROP's, TMUs and any other hardware functionality tuned for graphics, raster and geometry and trigonometry, workloads. The ROP's are where it all comes together to be put togather and rendered and that's Nvidia's strong point in that Nvidia uses more ROPs and keeps some base GPU die designs in the hold to bring out for gaming usage if AMD's offerings start getting a little bit too close in gaming performance.

AMD has only one base die design for Desktop and Porfessional compute with the Vega 10 die/blueprints so AMD has one and Nvidia has many and AMD's Vega 10 has no extra ROPs to speak of. Nvidia has loads of die designs with way more ROPs to be made available like when Nvidia took the GP102 die/blueprints and those 96 Available ROPs and spun out a GP102 based GTX 1080Ti(88 ROPs) because the Vega 10 die variants(Vega 64/56, both have 64 ROPs) were getting very good in competing with the GTX 1080/GTX 1070 that likewise have 64 ROPs. AMD has no extra base Vega die variants with different ratios of Shaders to TMUs to ROPs and even though Vega has way more TMU resources compared to Pascal it's those extra ROPs that Nvidia can bring out that keeps Nvidia in the FPS lead and Nvidia does have higher clocks also.

AMD's high power usage has less to do with its GPU micro-archs and more to do with having only one compute heavy base GPU design with loads of power hungry shaders to work with that has to be tuned for compute more than gaming. So Vega 10's ROPs available are minimal and more shaders are substituted because of compute reguirements of the professional markets.

According to Wikipedia:

"The render output unit, often abbreviated as "ROP", and sometimes called (perhaps more properly) raster operations pipeline, is a hardware component in modern graphics processing units (GPUs) and one of the final steps in the rendering process of modern graphics cards. The pixel pipelines take pixel (each pixel is a dimensionless point), and texel information and process it, via specific matrix and vector operations, into a final pixel or depth value. This process is called rasterization. So ROPs control antialiasing, when more than one sample is merged into one pixel. The ROPs perform the transactions between the relevant buffers in the local memory – this includes writing or reading values, as well as blending them together. Dedicated antialiasing hardware used to perform hardware-based antialiasing methods like MSAA is contained in ROPs.

" All data rendered has to travel through the ROP in order to be written to the framebuffer, from there it can be transmitted to the display.

Therefore, the ROP is where the GPU's output is assembled into a bitmapped image ready for display.

Historically the number of ROPs, TMUs, and shader processing units/stream processors have been equal. However, from 2004, several GPUs have decoupled these areas to allow optimum transistor allocation for application workload and available memory performance. As the trend continues, it is expected that graphics processors will continue to decouple the various parts of their architectures to enhance their adaptability to future graphics applications. This design also allows chip makers to build a modular line-up, where the top-end GPUs are essentially using the same logic as the low-end products." (1)


"Render output unit"(raster operations pipeline)

December 16, 2017 | 11:56 AM - Posted by asH95 (not verified)

NVIDIA Volta Allegedly Launching In 2017 On 12nm FinFET Technology

by Ryan Smith May 10, 2017
NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 Accelerator Announced

'But starting with the raw specficiations, the GV100 is something I can honestly say is a audacious GPU, an adjective I’ve never had a need to attach to any other GPU in the last 10 years. In terms of die size and transistor count, NVIDIA is genuinely building the biggest GPU they can get away with: 21.1 billion transistors, at a massive 815mm2, built on TSMC’s still green 12nm “FFN” process (the ‘n’ stands for NVIDIA; it’s a customized higher perf version of 12nm for NVIDIA).'

December 16, 2017 | 01:16 PM - Posted by IfYasDontHaveTheROPsYaDontsGetTheFPS (not verified)

Yes it's big but what percentage of 815mm2 is usable for gaming focused workloads. And Nvidia has no extra ROP counts above 96 on Volta GV100/GV102 relative to GP100/GP102.

So the Volta GV104 variants(GTX ##80/GTX ##70) better start out with more than 64 ROPs(GTX 1080/1070) and up to 88 ROPs or Nvidia's previous generation SKUs with 88 ROPs(GTX 1080Ti/GP102 based) will compete very well with Volta/GV104. GV104 is not going to have that large L2 cache that GV100/102 may have available and GV104 has to get the shader counts trimmed or power usage is going to be higher. GV100 actually has some TMU resources in its design to top AMD's TMU heavy Vega designs. But Volta's top allotment of ROPs has not changed so maybe AMD can get its ROP counts up to 96 on some new base die variant that makes use of the Vega GPU micro-arch with just the right amount of shaders for gaming focused workloads.

December 17, 2017 | 07:58 AM - Posted by Jabbadap

ROPs count is tied to L2 cache and memory busses, Titan V is neutered by one channel and it has less L2 than full GV100(4.5M vs 6M) so it's safe to assume full gv100 has rop count of 128 or in other words 32 per hbm2 chip. In that way it sounds similar to pascal gp100.

ROP count for their Gddr cards has been 8 Rops per 32bit channel, which I don't believe will change. That would make 32 for 128bit card(~gv107), 48 for 192bit card(~gv106) and 64 for 256bit card(~gv104) and 96 for 384bit card(~gv102).

December 17, 2017 | 04:40 PM - Posted by VoltaNeedsMoreROPsThanPascal (not verified)

Yes but Titan V(at 4.5 MB L2) has 1.5MB more L2 cache than Titan Xp with both having 96 ROPs available and even missing that one HBM2 stack there is still plenty of extra bandwidth coming from the 3 remaining HBM2 stacks.

Titan V's Gaming Performance is not that much higher than Titan Xp's and the Volta based GTX 1180/2080(Whatever numbering they use) is going need to be as fast at gaming as the GTX 1080Ti to give Nvidia users a reason to upgrade. Nvidia is like Intel in that Nvidia is competing with itself and AMD and Nvidia's Volta GPUs need to outperform Nvidia's Pascal offerings and outperform AMD Vega offerings or Nvidia's customer base will not want to update to the latest. So if any Vega refresh gets Vega 64 closer to the GTX 1080Ti then the Volta GTX 1180/2080(Whatever) will have to compete with Vega and Pascal(GTX 1080Ti) so 88(GTX 1080Ti's ROP count) or more ROPs on the Volta GTX 1180/2080(Whatever) GV104 variants.

From Nvidia development Blog:

"Similar to the previous generation Pascal GP100 GPU, the GV100 GPU is composed of multiple Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GV100 GPU consists of six GPCs, 84 Volta SMs, 42 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores. Each SM also includes four texture units.

Figure 5: Volta GV100 Full GPU with 84 SM Units.
Figure 4: Volta GV100 Full GPU with 84 SM Units.

With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache. Figure 4 shows a full GV100 GPU with 84 SMs (different products can use different configurations of GV100). The Tesla V100 accelerator uses 80 SMs." (1)

So for Tesla V100:

GPU__________________________________GV100 (Volta)
FP32 Cores / SM______________________64
FP32 Cores / GPU_____________________5120
FP64 Cores / SM______________________32
FP64 Cores / GPU_____________________2560
Tensor Cores / SM____________________8
Tensor Cores / GPU___________________640
GPU Boost Clock______________________1530 MHz
Peak FP32 TFLOP/s*___________________15.7
Peak FP64 TFLOP/s*___________________7.8
Peak Tensor Core TFLOP/s*____________125
Texture Units________________________320
Memory Interface_____________________4096-bit HBM2
Memory Size__________________________16 GB
L2 Cache Size________________________6144 KB
Shared Memory Size / SM______________Configurable up to 96 KB
Register File Size / SM______________256KB
Register File Size / GPU_____________20480 KB
TDP__________________________________300 Watts
Transistors__________________________21.1 billion
GPU Die Size_________________________815 mm²
Manufacturing Process________________12 nm FFN

* boost clock


"Inside Volta: The World’s Most Advanced Data Center GPU"

December 18, 2017 | 07:18 AM - Posted by Martin (not verified)

Is there a good overview somewhere about what these compute tests rely on or which strengths-weaknesses of GPUs they expose?

There are some results that do not sound very right. Just as an example:
- Some of the ties with Titan Xp are suspicious.
- SiSoft Sandra 2017 GPGPU Image Processing test is said to do well on Vega due to FP16. Volta is definitely claimed to have the same double rate of FP16 compared to FP32 so is that not exposed in API or does the benchmark not implement that?

December 18, 2017 | 01:35 PM - Posted by BenchmarksAreNeverCompletelyUpToDate (not verified)

Volta also has those tensor cores that are just matrix math units so that's a whole lot of 16 bit math via matirx operations that can be made to do direct image processing workloads and via the TensorFlow libraries do AI accelerated image processing workloads.

Vega can make use of the same TensorFlow libraries but Vega has no Tensor Cores to accelerate matrix math operations like Volta currently has.

So that Adobe/other software can use Volta's tensor cores for AI and identify people in an image/video and automatically mask out the background via AI TensorFlow library functions or graphics software can make use of the Tensor Cores for old fashon matrix math that is used by graphics applications for effects/filtering.

I'd expect that now that Volta/Titan V is available that plenty of developers are now using Titan V and also the Vega Founders edition SKUs which cost much less than the Titan V. The Vega FE can be purchased for around $730 to $850 on sale and 3 FEs pack pleny of FP16 and FP 32 performance for less than the price of the Titan V. And 4 Vega FEs at the $730 price point and plenty more FP 16 and FP 32 for a bit less than $3000(Titan V).

The TensorFlow libraries are what make for the AI and not the specific hardware and you could run AI/TensrFlow workloads on CPUs but that's not very efficient compared to GPUs with their massively parallel shader cores or with Volta its Tensor Cores(matrix math cores). Hell those tensor cores on Volta probably would do great for accelerating spreadsheet wotkloads if the numbers are not too large.

The benchmarks are always behind the hardware anyways especially any just relesased hardware. The Graphics software ecosystem take months to cath up and the tweaking never ceases even after the next generation arrives if any more efficient way of doing the calculations is discoverd for older hardware.

January 1, 2018 | 07:20 PM - Posted by Anonymous90250-A (not verified)

The key high-cost operation of statistical learning algorithms (aka "deep learning" if you like buzzwords) is optimization - i.e. finding the point a high-dimensional state space where a let's-just-call-it-positive-valued "loss function" obtains its minimum value.

Long story short, this means either inverting high-dimensional matrices or performing clever and/or unnatural acts to avoid directly inverting those matrices (look up Newton's method from Calc 101 to get an idea why).

These methods are quite sensitive to round-off error and are generally recommended to be performed in double.

The publicly available Nvidia documentation shows that the tensor cores principal function is an affine operation of multiplying a pair of 4-dimensional 16-bit vector-matrix elements, optionally adding a 4-D vector-matrix offset (sound familiar graphics people?) and storing the result in a 32-bit result (because multiplication).

This may be useful in the so-called "inference phase" of "deep learning", however, "inference phase" is likewise a fairly recent buzzword in the field which very explicitly means the part of the process where no learning is taking place. Doing inference quickly is important for real-time controls such as driving a car, but in real time systems where an error can kill people, I would personally opt for avoiding roundoff error.

Furthermore, Deep Learning frameworks - particularly those pushed by hardware vendors - have had their evolution shaped by the hardware that runs them fastest, or, in the somewhat more sobering case, their evolution has been shaped by the hardware that is promoted by online courses in Deep Learning that treat learning as a grey-box to be implemented with frameworks.

What should be astronomically improved is operations that require double - i.e. learning phase. However that is not what you hear if you go poking around tests being done by the deep learning community. Google it for yourself.

Finally, there's the question of why is it that the Xp can only do single floats while only the many-times more expensive V100, GP100, and TItan V can? Quadro users in the CG community have been complaining about similar discrepancies for years. They seem to have some well-substantiated theories regarding the cause of the inability of lower-priced cards to do 64 bit.

Bottom line, none of these benchmarks really have much to say about the performance of the Titan V relative to the bold and profound claims of its marketing materials.

I am fairly confident we will soon see benchmarks that show the value of FP64 in the Titan V, but the Tensor Cores will be more of a stretch. But the hardware vendors have recently been quite bullish on Deep Learning frameworks, and how they are a game changer that allows users to work at a high level without concerning themselves with what's under the hood.

And neither hardware vendors nor cloud services providers are particularly motivated to steer users away from practices that cause them to buy the much-more-expensive hardware nor from practices that cause them to use twenty times the cloud compute time that they otherwise might.

Food for thought: as our friends The Lawyers like to rhetorically ask, "Cui Buono?"

December 21, 2017 | 12:00 AM - Posted by Jonathan (not verified)

Why was TITAN V's driver prohibited from using in the data center?

December 30, 2017 | 10:17 AM - Posted by AnonymousLocust (not verified)

Because a data center is not a PC Bang...
(hint is in the driver name "Nvidia Geforce Software")

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.