What you need to know about ray tracing and NVIDIA's Turing architecture

NVIDIA says its Turing architecture represents the next great leap in graphics technology. Here's what you need to know.

By Koh Wanzi - 14 Sep 2018

How does ray tracing compare to rasterization?

NVIDIA’s new Turing cards are supposed to represent some sort of paradigm shift. They feature dedicated RT cores for real-time ray tracing, a first for consumer graphics, and the company is claiming massive performance and visual improvements if developers code their games for the new hardware.

But what is ray tracing really? It sounds like some esoteric graphics rendering technique, but you’ve probably already enjoyed the results of it on the big screen. Modern movies use ray tracing to generate or bolster special effects, which is how you get photorealistic shadows, reflections, and refractions. It’s how filmmakers create blazing conflagrations and blend them with the real world.

Ray tracing isn't new, but doing it real-time on a consumer GPU is. (Image Source: NVIDIA)

In a nutshell, ray tracing involves following the path of light beams backward from your eye to the objects that the light ray interacts with. However, this requires immense computational power, which is why it’s been restricted mostly to the post-production stages of a movie, where filmmakers can take their time to render scenes and take advantage of render farms.

Before Turing, real-time ray tracing was a pipe dream for games. But if NVIDIA has its way, Turing could pave the way for a new generation of games that can generate photorealistic scenes in real-time.

How do games today do it?

Video games only have a fraction of a second to render a scene, so games today generally rely on something called rasterization instead. This is a method of translating 3D objects onto a 2D screen, and technological advancements mean that it’s actually pretty good.

In rasterization, 3D models of objects are comprised of a mesh of virtual triangles or polygons. Each corner of a triangle is referred to as a vertice, which in turn intersects with the vertices of other triangles of different sizes and shapes.

Each vertice retains a wealth of information, including details on color, texture, and its position in space. The triangles that form these 3D models are then converted into pixels on a 2D screen, with each pixel taking on an initial color value from the data in the corresponding vertex.

Next, further processing or “shading” is performed to change the pixel color depending on how lights in the scene hit it. Textures are also applied to the pixel, which factors into the final color applied.

Ray tracing is demanding, but rasterization is no walk in the park either. Millions of polygons could be used for all the object models in just one scene, and each frame could be refreshed up to 90 times per second, which is a lot of pixels to process in a very short time.

Why is ray tracing different?

Ray tracing hews more closely to how light works in the real world. Light rays hit 3D objects and bounce from one object to another before reaching our eyes. On the other hand, light may also be obscured by some objects, creating shadows. Then there is reflection and refraction, which can be challenging to simulate in computer graphics.

Ray tracing works backward from your eye, retracing the path of light from your eye to the object. In effect, it traces a light ray’s path through each pixel on a 2D screen back into a 3D model of the scene.

A look at ray tracing versus rasterization. (Image Source: NVIDIA)

It can also capture reflection, shadows, and refraction by mining color and lighting information at the intersection point between a light ray and an object. This data contributes to determining pixel color and the level of illumination, and if a ray bounces off or passes through the surfaces of different objects before reaching the light source, color and lighting information from all these objects contribute to the final pixel color as well.

These techniques, along with further refinements along the years, are why ray tracing is the technique of choice for movie-making today. It effectively captures the way light works, which is why you get scenes that are truer to life, replete with details like softer shadows and light that interacts more realistically with the environment.

What does this mean for games?

NVIDIA has so far announced 21 games that take advantage of some portion of the new Turing architecture, but only 11 actually utilize the RT cores for real-time ray tracing. Metro Exodus provides an illuminating example, and with ray tracing enabled, the rooms are far darker than with ray tracing off. In practice, this could give developers more freedom to tweak the atmospheric lighting to create their desired mood, in addition to concealing enemies in the shadows.

https://www.youtube.com/embed/7Yn09UHWYFY

Similarly, in Shadow of the Tomb Raider, the shadows blend together far more realistically, and you can distinguish between different gradations of shadow, where the words penumbra and umbra take on real meaning.

https://www.youtube.com/embed/k12cf15VvV4

Put simply, you games stand to look far better with NVIDIA’s RTX technology. Unfortunately, the Turing cards are pricey, and it’ll be some years before a significant portion of consumers even own a card capable of taking advantage of ray tracing, which is exactly what needs to happen before more developers get on board.

What's new in Turing?

NVIDIA GeForce RTX 2080 Ti. (Image Source: NVIDIA)

But ray tracing is only being talked about now because of NVIDIA's new Turing architecture. The Turing GPUs represent quite the departure from NVIDIA's traditional graphics architecture, featuring dedicated hardware capable of performing real-time ray tracing and deep learning operations that can help improve visual quality and performance.

Other new features include new shading advancements like mesh shading, variable rate shading, and texture-space shading.

The Turing GPUs are also the first to utilize GDDR6 memory, which provides higher bandwidth and better power efficiency.

NVIDIA added hardware support for USB-C and VirtualLink as well. VirtualLink is a new open industry standard being developed to meet the power, display, and bandwidth demands of next-generation VR headsets through a single USB-C connector. If this ends up gaining widespread adoption, you could end up with a single connector that works across multiple VR headsets.

The TU102 GPU

A look at the TU102 GPU is helpful in understanding exactly what has changed in Turing compared to previous generations. That's the GPU utilized by the flagship GeForce RTX 2080 Ti, and the TU104 and TU106 GPUs use the same basic architecture, but are scaled down to suit their respective models and market segments.

For starters, the TU102 GPU contains six Graphics Processing Clusters (GPCs). Each GPC in turn comprises six Texture Processing Clusters (TPCs) for a total of 36, and each TPC also includes two Streaming Multiprocessors (SMs) for a grand total of 72 SMs. Each SM then contains 64 CUDA cores and four texture units.

However, all this is pretty much par for the course. Yes, the TU102 is a beast with 4,608 CUDA cores, 288 texture units, 96 render output units (ROPs) and a 384-bit memory bus width. But what's really new is its implementation of dedicated RT cores and Tensor cores, a first for a consumer GPU.

The TU102 GPU is outfitted with 72 RT cores and 576 Tensor cores, forming the fundamental underpinning of the real-time ray tracing acceleration and deep learning neural graphics that you've been hearing so much about.

A new Streaming Multiprocessor

Turing features a new SM design that incorporates many of the features introduced in NVIDIA's Volta GV100 architecture. I'm not going to bore you with the nitty gritty details, but the end result is a major revamp of the core execution data paths, where the Turing SM now supports concurrent execution of FP32 and INT32 operations.

The Turing SMs have undergone quite a major redesign. (Image Source: NVIDIA)

Modern shader workloads typically have a mix of FP arithmetic instructions and simpler instructions such as integer additions for addressing and fetching data or floating point comparisons.

In previous shader architectures, the floating point math data path would sit idle whenever a non-FP math instruction was running. Turing changes this with the addition of a second parallel execution unit next to every CUDA core that can execute these instructions in tandem with floating point math, a more efficient approach.

Turing SMs allow for concurrent execution of floating point and integer instructions. (Image Source: NVIDIA)

According to NVIDIA, it sees about 36 integer pipe instructions for every 100 floating point instructions, so being able to execute these data paths concurrently should translate into a roughly 36 per cent extra throughput for floating point instructions as the data path no longer has to wait for non-FP instructions to complete.

The Turing SM also features a new unified architecture for shared memory, L1, and texture caching. Each SM has 96KB of L1/shared memory that can be configured for various capacities depending on the compute or graphics workloads. In comparison, Pascal had 96KB of shared memory and two separate 24KB blocks of L1 cache. Turing's L2 cache size has also been increased to 6MB, double Pascal's 3MB.

Turing features a new unified architecture for shared memory, L1, and texture caching. (Image Source: NVIDIA)

NVIDIA says the combined L1 data cache and shared memory subsystem should significantly improve performance and simplify the programming and tuning required to achieve optimal application performance.

All told, NVIDIA is claiming around a 50 per cent improvement in performance per CUDA core thanks to these changes.

New shading advancements

Building on better CUDA core performance, NVIDIA is also introducing new shading techniques to better utilize the available resources. The RTX cards aren't just about boosting raw horsepower, and NVIDIA has made plenty of improvements to make things more efficient.

The company singled out four techniques, including variable rate shading (VRS), texture-space shading, multi-view rendering (MVR), and mesh shading.

To simplify things, the crux of PC graphics rendering is all about calculating a color value for each pixel on a screen, or shading. VRS allows developers to control shading rates dynamically, so you can shade as little as once per 16 pixels or as often as eight times per pixel. This is a lot more efficient, as it reduces work in regions of the screen where full resolution shading would not give any visible image quality benefit.

VRS is a more efficient way of utilizing shader resources. (Image Source: NVIDIA)

In other words, developers can cut back on areas where you won't notice drops in quality, and improve frame rates in the process.

There are already several classes of VRS-based algorithms, among them Content Adaptive Shading (where shading work varies on content level of detail), Motion Adaptive Shading (shading based on rate of content motion), and Foveated Rendering (used in VR applications and based on eye position).

Then there's texture-space shading, where objects are shaded in something called a texture space that is saved to memory, and pixel shaders sample from that space rather than having to calculate values directly.

This caching of shader results in memory also lets you reuse and resample them over multiple frames, so developers can avoid duplicate shading work or use different sampling approaches to improve quality.

Texture-space shading and MVR gives developers more tools to play with. (Image Source: NVIDIA)

MVR builds on Pascal's single-pass stereo capabilities, which allowed rendering of two views in a single pass. Similarly, MVR allows the rendering of more than two views in a single pass, and it can do this even if the views are based on totally different origin positions or view directions.

Finally, mesh shading reduces the burden on your CPU during visually complex scenes with hundreds of thousands of unique objects by adding two new shader stages, Task Shaders and Mesh Shaders. This model is more flexible and allows developers to eliminate CPU draw call bottlenecks.

Mesh shading takes some of the load off your CPU in visually complex scenes. (Image Source: NVIDIA)

The Task Shader stage performs object culling to decide which elements of a scene need to be rendered. The Mesh Shader then determines the level of detail at which to render visible objects. This depends on a number of factors, so closer objects would look sharper for instance while farther ones can afford to be less detailed.

GDDR6 memory

As the gaming industry pushes ahead with higher resolution displays and more complex rendering techniques, memory bandwidth and size has come to play an increasingly important role in graphics performance. Not only must the GPU have sufficient memory bandwidth, it also needs a generous pool of memory to draw from to sustain high frame rates.

The TU102 GPU uses GDDR6 memory, which lays claim to faster speeds, better power efficiency, and improved noise reduction. For example, extensive clock gating is used to minimize power consumption during periods of lower utilization.

According to NVIDIA, Turing's GDDR6 memory subsystem is capable of delivering up to 14Gbps of throughput and 20 per cent better power efficiency than Pascal's GDDR5X memory.

The combination of raw bandwidth increases and memory compression techniques reportedly gives Turing a 50 per cent increase in effective bandwidth over Pascal. (Image Source: NVIDIA)

In addition, Turing utilizes improved memory compression techniques to further increase effective bandwidth (on top of the raw bandwidth increases offered by GDDR6).

Real-time ray tracing acceleration

The new RT cores in each SM are at the heart of Turing's ray tracing acceleration capabilities. I already talked briefly about what ray tracing is at the beginning of the article, and now it's time to look at how Turing enables it all.

However, while Turing GPUs do enable real-time ray tracing, the number of primary or secondary rays cast per pixel or surface location varies based on many factors, such as scene complexity, resolution, and how powerful the GPU is. This means you shouldn't expect hundreds of rays to be cast per pixel in real-time.

Instead, NVIDIA says far fewer rays are actually needed per pixel taking advantage of the RT cores' real-time ray tracing acceleration and advanced de-noising filtering techniques.

The crux of the matter is something called BVH traversal, short for Bounding Volume Hierarchy. This is basically a method for optimizing intersection calculations, where objects are bounded by larger, simpler volumes.

A graphical illustration of how BVH works. (Image Source: NVIDIA)

GPUs without dedicated ray tracing hardware would need to perform the process of BVH traversal using shader operations, requiring thousands of instruction slots per ray cast to check against successively smaller bounding boxes in a BVH structure until possibly hitting a polygon. The color at the point of intersection would then contribute to the final pixel color.

Many thousands of instruction slots would be required per ray, an extremely computationally intensive task. (Image Source: NVIDIA)

In short, it's extremely computationally intensive and impossible to do on GPUs in real-time without hardware-based ray tracing acceleration.

RT cores

NVIDIA's solution is to have the Turing RT cores handle all the BVH traversal and ray-triangle intersection testing, which saves the SMs from spending thousands of instruction slots per ray.

The RT cores comprises of two specialized units. The first carries out the bounding box tests, while the second performs ray-triangle intersection tests and reports on whether it's a hit or not back to the SM. This frees up the SM to do other graphics or compute work.

The Turing RT cores process all the BVH traversal and ray-triangle intersection testing. (Image Source: NVIDIA)

NVIDIA NGX

Turing's final highlight is NVIDIA NGX, which is a new deep learning technology stack that is a part of NVIDIA's RTX platform. NGX utilizes deep neural networks to perform AI-based functions capable of accelerating and enhancing graphics, among other things.

NGX relies on the Turing Tensor cores for deep learning-based operations, and it does not work on older architectures prior to Turing.

These are just some of the applications of deep learning. (Image Source: NVIDIA)

Tensor cores

Turing uses an improved version of the Tensor cores first introduced in the Volta GV100 GPU. For instance, FP16 is now fully supported for workloads that require higher precision.

According to NVIDIA, the Turing Tensor cores significantly speed up matrix operations and are used for both deep learning training and inference operations, in addition to new neural graphics functions.

Of these, the Tensor cores excel in particular at inference computations, where useful information is inferred and delivered by a trained deep neural network based on a specified input. This includes things like identifying images of friends in Facebook photos and real-time translations of human speech, but gamers are probably most interested in the Tensor cores' ability to improve image quality and produce better looking games with a smaller performance hit.

Deep Learning Super Sampling (DLSS)

NGX encompasses many things, but the one NVIDIA gave the most attention to is something called deep learning super sampling, or DLSS. This can be thought of as a new method of anti-aliasing that helps reduce jagged lines and prevent blocky images. However, the key difference is that it doesn't run on the shader cores, which frees them up to do other work.

In a sense, this is free AA, where you get better looking graphics without the usual performance hit. Turning MSAA on in a game like Deus Ex: Mankind Divided is enough to cripple some of the most powerful systems, and DLSS offers a possible way around that.

In modern games, rendered frames go through post-processing and image enhancements that combine input from multiple rendered frames in order to remove visual artifacts such as aliasing while still preserving detail.

DLSS applies AI to this process and is supposedly capable of a much higher quality than temporal anti-aliasing (TAA), a shader-based algorithm that combines two frames using motion vectors to determine where to sample the previous frame.

NVIDIA says DLSS should look better than TAA and still have less of a performance hit. (Image Source: NVIDIA)

TAA renders at the final target resolution and then combines frames, losing detail in the process. However, NVIDIA says DLSS permits faster rendering at a lower input sample count, and infers a result that should rival TAA but requires approximately half the shading work in the process.

The interesting part is that DLSS can be "trained", where it learns how to produce the desired output based on large numbers of super high quality images. NVIDIA says it collected reference images rendered using 64x super sampling, where each pixel is shaded at 64 different offsets instead of just one. This results in a high level of detail and excellent anti-aliasing results.

DLSS can be trained against images rendered using 64x super sampling. (Image Source: NVIDIA)

The DLSS network is then trained by trying to match the 64x SS output frames with its own, measuring the differences between the two, and making the necessary adjustments.

Eventually, DLSS learns to produce results that come close to that of 64x SS, while avoiding problems that can arise in more challenging scenes, such as blurring, disocclusion (where a previously occluded object becomes visible), and unwanted transparency.

The biggest bonus is that RTX cards will supposedly run up to twice as fast as previous-generation GPUs using conventional anti-aliasing, assuming the game supports DLSS.

NVIDIA is making some big claims about DLSS. (Image Source: NVIDIA)

And that's really a biggest caveat with NVIDIA's RTX cards. The new features all sound great on paper, but that's only if there's robust developer support in the long run. Still, things are looking up, with nine new games announcing support for DLSS today.

Developers will need to support DLSS for it to really take off. (Image Source: NVIDIA)

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.