Feature Articles

What you need to know about ray tracing and NVIDIA's Turing architecture

By Koh Wanzi - 14 Sep 2018

An introduction to Turing

What's new in Turing?

NVIDIA GeForce RTX 2080 Ti. (Image Source: NVIDIA)

But ray tracing is only being talked about now because of NVIDIA's new Turing architecture. The Turing GPUs represent quite the departure from NVIDIA's traditional graphics architecture, featuring dedicated hardware capable of performing real-time ray tracing and deep learning operations that can help improve visual quality and performance.

Other new features include new shading advancements like mesh shading, variable rate shading, and texture-space shading.

The Turing GPUs are also the first to utilize GDDR6 memory, which provides higher bandwidth and better power efficiency.

NVIDIA added hardware support for USB-C and VirtualLink as well. VirtualLink is a new open industry standard being developed to meet the power, display, and bandwidth demands of next-generation VR headsets through a single USB-C connector. If this ends up gaining widespread adoption, you could end up with a single connector that works across multiple VR headsets.

 

The TU102 GPU

A look at the TU102 GPU is helpful in understanding exactly what has changed in Turing compared to previous generations. That's the GPU utilized by the flagship GeForce RTX 2080 Ti, and the TU104 and TU106 GPUs use the same basic architecture, but are scaled down to suit their respective models and market segments. 

TU102. (Image Source: NVIDIA)

For starters, the TU102 GPU contains six Graphics Processing Clusters (GPCs). Each GPC in turn comprises six Texture Processing Clusters (TPCs) for a total of 36, and each TPC also includes two Streaming Multiprocessors (SMs) for a grand total of 72 SMs. Each SM then contains 64 CUDA cores and four texture units. 

However, all this is pretty much par for the course. Yes, the TU102 is a beast with 4,608 CUDA cores, 288 texture units, 96 render output units (ROPs) and a 384-bit memory bus width. But what's really new is its implementation of dedicated RT cores and Tensor cores, a first for a consumer GPU.

The TU102 GPU is outfitted with 72 RT cores and 576 Tensor cores, forming the fundamental underpinning of the real-time ray tracing acceleration and deep learning neural graphics that you've been hearing so much about.

 

A new Streaming Multiprocessor

Turing features a new SM design that incorporates many of the features introduced in NVIDIA's Volta GV100 architecture. I'm not going to bore you with the nitty gritty details, but the end result is a major revamp of the core execution data paths, where the Turing SM now supports concurrent execution of FP32 and INT32 operations

The Turing SMs have undergone quite a major redesign. (Image Source: NVIDIA)

Modern shader workloads typically have a mix of FP arithmetic instructions and simpler instructions such as integer additions for addressing and fetching data or floating point comparisons. 

In previous shader architectures, the floating point math data path would sit idle whenever a non-FP math instruction was running. Turing changes this with the addition of a second parallel execution unit next to every CUDA core that can execute these instructions in tandem with floating point math, a more efficient approach. 

Turing SMs allow for concurrent execution of floating point and integer instructions. (Image Source: NVIDIA)

According to NVIDIA, it sees about 36 integer pipe instructions for every 100 floating point instructions, so being able to execute these data paths concurrently should translate into a roughly 36 per cent extra throughput for floating point instructions as the data path no longer has to wait for non-FP instructions to complete. 

The Turing SM also features a new unified architecture for shared memory, L1, and texture caching. Each SM has 96KB of L1/shared memory that can be configured for various capacities depending on the compute or graphics workloads. In comparison, Pascal had 96KB of shared memory and two separate 24KB blocks of L1 cache. Turing's L2 cache size has also been increased to 6MB, double Pascal's 3MB.

Turing features a new unified architecture for shared memory, L1, and texture caching. (Image Source: NVIDIA)

NVIDIA says the combined L1 data cache and shared memory subsystem should significantly improve performance and simplify the programming and tuning required to achieve optimal application performance. 

All told, NVIDIA is claiming around a 50 per cent improvement in performance per CUDA core thanks to these changes. 

 

New shading advancements

Building on better CUDA core performance, NVIDIA is also introducing new shading techniques to better utilize the available resources. The RTX cards aren't just about boosting raw horsepower, and NVIDIA has made plenty of improvements to make things more efficient. 

The company singled out four techniques, including variable rate shading (VRS), texture-space shadingmulti-view rendering (MVR), and mesh shading.

To simplify things, the crux of PC graphics rendering is all about calculating a color value for each pixel on a screen, or shading. VRS allows developers to control shading rates dynamically, so you can shade as little as once per 16 pixels or as often as eight times per pixel. This is a lot more efficient, as it reduces work in regions of the screen where full resolution shading would not give any visible image quality benefit. 

VRS is a more efficient way of utilizing shader resources. (Image Source: NVIDIA)

In other words, developers can cut back on areas where you won't notice drops in quality, and improve frame rates in the process. 

There are already several classes of VRS-based algorithms, among them Content Adaptive Shading (where shading work varies on content level of detail), Motion Adaptive Shading (shading based on rate of content motion), and Foveated Rendering (used in VR applications and based on eye position). 

Then there's texture-space shading, where objects are shaded in something called a texture space that is saved to memory, and pixel shaders sample from that space rather than having to calculate values directly. 

This caching of shader results in memory also lets you reuse and resample them over multiple frames, so developers can avoid duplicate shading work or use different sampling approaches to improve quality.

Texture-space shading and MVR gives developers more tools to play with. (Image Source: NVIDIA)

MVR builds on Pascal's single-pass stereo capabilities, which allowed rendering of two views in a single pass. Similarly, MVR allows the rendering of more than two views in a single pass, and it can do this even if the views are based on totally different origin positions or view directions.  

Finally, mesh shading reduces the burden on your CPU during visually complex scenes with hundreds of thousands of unique objects by adding two new shader stages, Task Shaders and Mesh Shaders. This model is more flexible and allows developers to eliminate CPU draw call bottlenecks.

Mesh shading takes some of the load off your CPU in visually complex scenes. (Image Source: NVIDIA)

The Task Shader stage performs object culling to decide which elements of a scene need to be rendered. The Mesh Shader then determines the level of detail at which to render visible objects. This depends on a number of factors, so closer objects would look sharper for instance while farther ones can afford to be less detailed.

 

GDDR6 memory

As the gaming industry pushes ahead with higher resolution displays and more complex rendering techniques, memory bandwidth and size has come to play an increasingly important role in graphics performance. Not only must the GPU have sufficient memory bandwidth, it also needs a generous pool of memory to draw from to sustain high frame rates. 

The TU102 GPU uses GDDR6 memory, which lays claim to faster speeds, better power efficiency, and improved noise reduction. For example, extensive clock gating is used to minimize power consumption during periods of lower utilization. 

According to NVIDIA, Turing's GDDR6 memory subsystem is capable of delivering up to 14Gbps of throughput and 20 per cent better power efficiency than Pascal's GDDR5X memory. 

The combination of raw bandwidth increases and memory compression techniques reportedly gives Turing a 50 per cent increase in effective bandwidth over Pascal. (Image Source: NVIDIA)

In addition, Turing utilizes improved memory compression techniques to further increase effective bandwidth (on top of the raw bandwidth increases offered by GDDR6). 

Join HWZ's Telegram channel here and catch all the latest tech news!
Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.