The Architecture of NVIDIA's RTX GPUs - Turing Explored
A Look Back and Forward
Although NVIDIA's new GPU architecture, revealed previously as Turing, has been speculated about for what seems like an eternity at this point, we finally have our first look at exactly what NVIDIA is positioning as the future of gaming.
Unfortunately, we can't talk about this card just yet, but we can talk about what powers it
First though, let's take a look at the journey to get here over the past 30 months or so.
Unveiled in early 2016, Pascal marked by the launch of the GTX 1070 and 1080 was NVIDIA's long-awaited 16nm successor to Maxwell. Constrained by the oft-delayed 16nm process node, Pascal refined the shader unit design original found in Maxwell, while lowering power consumption and increasing performance.
Next, in May 2017 came Volta, the next (and last) GPU architecture outlined in NVIDIA's public roadmaps since 2013. However, instead of the traditional launch with a new GeForce gaming card, Volta saw a different approach.
Launching with the Tesla V100, and later expanding to the TITAN and Quadro lines, it eventually became clear that at least this initial iteration of Volta was targeted at high-performance computing. A record-breaking CUDA core count of 5120, HBM 2.0 memory, and deep learning acceleration in the form of fixed function hardware called Tensor cores resulted in a large silicon die, unoptimized for gaming.
This left gamers looking for the next generation of GeForce products left in a state of confusion of if we would ever see Volta GPUs for consumers.
In reality, Volta seemingly was never intended in any form for gamers and instead marks the departure of NVIDIA's high-end compute-focused GPUs from its gaming offerings.
Instead, NVIDIA now has the ability run two different GPU microarchitectures, targeted at two vastly different markets in parallel, Volta for high-end compute and high-end deep learning application, and Turing for their bread-and-butter industry, gaming.
This means instead of tailoring a single architecture for the best compromise between these different workloads, NVIDIA will be able to adapt their GPU designs to best suit each application.
A distinct lack of performance or technical details at RTX 2080 and 2080 Ti has lead to rampant internet speculation that Turing merely is Pascal 2.0, with some extra dedicated hardware for ray tracing and deep learning, running on 12nm. However, this couldn't be further from the truth.
At the heart of Turing is the all-new Turing Streaming Processor (SM). Split into four distinct blocks; the Turing SM provides a departure from the SM design seen previously in Maxwell and Pascal.
In each SM, you'll find 16 FP32 Cores, 16 INT32 Cores, two Tensor Cores, one warp scheduler, and one dispatch unit. Additionally, there is one RT core found in every SM, for ray tracing acceleration. Notable here is the addition of INT32 cores, with the ability to execute INT32 and FP32 instructions simultaneously, as seen in Volta.
Simultaneous execution of these different precision workloads allows for more efficient use, requiring fewer clock cycles to achieve the same amount of work. This simultaneous execution ability is enabled the redesigned memory interface of the Turing SM.
Previously, as seen in Pascal, access to the memory subsystem and access to the L1 cache were split. In Turing, NVIDIA has moved to a new unified memory architecture, which allows for a larger L1 cache, which is configurable on the fly between a 64KB (+32KB Shared Memory) and a 32KB (+64KB Shared Memory) split. Additionally, the L2 sees a doubling per SM from 3MB to 6MB.
The increase (up to 2.7x) in L1 size and addressability results in what NVIDIA is claiming is a 2x increase to L1 hit bandwidth, and lower L1 hit latencies
Overall, NVIDIA puts the changes to allow simultaneous execution of both INT32 and FP32 instructions, as well as the changes to the memory architecture at a 50% performance improvement per SM, when compared to Pascal.
In addition to the layout changes of the memory subsystem, Turing also sees a different memory interface, in GDDR6.
Operating at 14 Gbps, GDDR6 provides an almost 30% memory speedup when compared to the GTX 1080 Ti, which runs its GDDR5X memory at 11 Gbps, while remaining 20% more power efficient than Pascal GPUs.
Through careful board layout considerations, NVIDIA claims a 40% reduction in signal crosstalk compared to GDDR5X implementations, which is in part how they can achieve such a higher transfer rate.
Building upon Pascal’s memory compression techniques, NVIDIA also claims a substantial improvement to memory compression technology with Pascal, which when combined with the faster GDDR6 memory results in an overall 50% higher effective memory bandwidth for Turing.
Having seen their introduction in the NVIDIA’s Volta architecture, Tensor Cores also see some significant changes in Turing.
In addition to supporting FP16, Turing’s Tensor cores add the INT4 and INT8 precision modes. While at the moment there aren’t any real-world uses for the lower precision INT4 and INT8 modes, NVIDIA is hoping these smaller data types will find applications in gaming, where less accuracy may be acceptable as compared to scientific research.
Since the throughput of these different Tensor core modes scales linearly from 110 TFLOPS of FP16 performance translates into 220 TFLOPS INT8, and 440TFLOPS INT4, there are some opportunities for massive speedups in workloads where the precision afforded by FP16 is unneeded.
The all-new type of hardware present in Turing is the RT cores. Meant to accelerate Ray Tracing operations, the RT cores are the key to the RTX 2080 and 2080 Ti’s namesake, the NVIDIA RTX Real-time Ray tracing API.
From a data structures perspective, one of the most common ways to accomplish ray tracing at the moment is through the use of something called a Bounding Volume Hierarchy (BVH).
While the more brute force attempts at ray tracing would involve calculating if every individually casted ray intersects with every triangle in the scene, BVH’s are a more optimal way to solve this problem.
At a high level, a BVH is a data structure made up of groups of triangles in a given object/scene. Triangles are grouped in a hierarchical structure so that fewer operations are needed to know test which triangles are intersected by any given ray.
While BVH’s speed up ray tracing on any given hardware compared to more classical brute force methods, NVIDIA has built hardware to specifically accelerate BVH transversal in Turing, in the form of what they are calling RT cores.
Able to run in parallel with other operations of the GPU, the RT cores can perform his BVH transversal while the shaders are rendering the rest of the scene, providing a massive speedup compared to traditional ray tracing methods.
The metric that NVIDIA has come up with to quantify ray tracing performance is the idea of a “Giga Ray.” Through the use of RT cores, NVIDIA is claiming a 10X speedup from the GTX 1080 Ti to the RTX 2080 Ti, from just over 1 Giga Rays to 10 Giga Rays.
The GPU instructions are abstracted through the RTX API, which is compatible with both Microsoft’s DirectX Raytracing (DXR) API, and Vulkan’s Ray Tracing API (soon to come). As long as the developer can generate an appropriate BVH for the given scene, the Turing-based GPU will handle the calculation of ray intersections, which the game engine can then act upon in rendering the scene.
However, while NVIDIA has shown several impressive demos of RTX Ray tracing technology implemented in games such as Shadow of the Tomb Raider and Battlefield V, whether or not developers will implement DXR/RTX into their games remains a big question.