Feature Articles

What makes the GeForce RTX 40 series go so fast?

By Vijay Anand - 7 Oct 2022

What makes the GeForce RTX 40 series go so fast?

Note: This article was first published on 21 September 2022.

The Ada architecture is a quantum leap for gamers and paves the way for creators of fully simulated worlds, like Omniverse. (Image source: NVIDIA)

The makings of a next-gen GPU champ

NVIDIA just launched their jaw-dropping GeForce RTX 4090, which is powered by their new Ada Lovelace GPU architecture. Equipped with third-gen RT cores and fourth-gen Tensor Cores for massive AI uplift that also drives its next-gen DLSS 3 capability, the Ada Lovelace GPU has a lot to offer gamers with unrivalled levels of fidelity, realism and immersion. Think of 4K gaming at the highest settings and ray-tracing on all the time.

Estimated ray-traced performance of GeForce RTX 4090 vs. RTX 3090 Ti. (Click to view larger image)

In broader strokes, the new GeForce RTX 4090 is twice as fast as GeForce RTX 3090 Ti in normally rasterised games (and up to four times faster for ray-traced titles) while using the same amount of power. This is quite a leap in capability, considering the last generation's GeForce RTX 3080 was 'only' twice as fast as the RTX 2080 counterpart, a non-Ti edition. Meanwhile, the GeForce RTX 4080 is billed to be at least twice as fast as RTX 3080 Ti.

The estimated performance of GeForce RTX 4080 vs. RTX 3080 Ti. (Click to view larger image)

Suffice it to say, these are pretty extreme leaps in performance capabilities.

 

What makes the GeForce RTX 40 series go so fast?

(Image source: NVIDIA)

Based on none other than the radically robust Hopper GPU architecture that first debuted in NVIDIA's data centre-oriented H100 GPU, the Ada Lovelace architecture for gamers and creators benefits a lot from it and is a crucial reason why the Ada Lovelace packs a wallop. With more than twice the transistor count and manufactured on the new 4nm process technology node, the Ada Lovelace immediate boasts massive brute force performance through far higher clock speeds (refer to the table comparison below), as well as many more CUDA, Tensor and RT cores at its disposal, in addition to the generational leap brought about by the Hopper architecture that advances each of the core's capabilities.

GPUs compared
GeForce
Graphics Card
RTX 
4090
RTX 
4080
(16GB)
RTX 
4080
(12GB)
RTX
3090 Ti
RTX
3090
RTX
3080 Ti
RTX
3080
GPU Ada Lovelace (AD102) Ada
Lovelace 
(AD103)
Ada
Lovelace 
(AD104)
Ampere
(GA102)
Ampere
(GA102-300)
Ampere
(GA102-225)
Ampere
(GA102-200)
Process 4N
(TSMC)
8nm
(Samsung)
Die Size (mm2) TBC TBC TBC 628 628 628 628
Transistors 76 billion TBC TBC 28
billion
28
billion
28
billion
28
billion
Streaming Multi-processors (SM) 128 76 60 84 82 80 68
CUDA cores 16384 9728 7680 10752 10496 10240 8704
Tensor Cores 512
(Gen 4)
304
(Gen 4)
240
(Gen 4)
336
(Gen 3)
328
(Gen 3)
320
(Gen 3)
272
(Gen 3)
RT Cores 128
(Gen 3)
76
(Gen 3)
60
(Gen 3)
84
(Gen 2)
82
(Gen 2)
80
(Gen 2)
68
(Gen 2)
Render Output Units (ROPs) TBC TBC TBC 112 112 112 96
GPU base / boost clocks (MHz) 2230 / 2520 2210 / 2510 2310 / 2610 1670 /
1860
1395 /
1695
1440 /
1710
1440 /
1710
Memory 24GB GDDR6X 16GB 
GDDR6X
12GB 
GDDR6X
24GB GDDR6X 24GB GDDR6X 10GB GDDR6X 10GB GDDR6X
 Memory clock speed TBC TBC TBC 21GHz
(2.625Gbps)
19.5GHz
(2.437Gbps)
19GHz
(2.375Gbps)
19GHz
(2.375Gbps)
Memory bus width 384-bit 256-bit 192-bit 384-bit 384-bit 384-bit 320-bit
Memory bandwidth TBC TBC TBC 1,008GB/s 936GB/s 912GB/s 760GB/s
Interface PCIe 4.0 PCIe 4.0
TDP 450W 320W 285W 450W 350W 350W 320W
Price US$1,599 US$1,199 US$899 US$1,999 US$1,499 US$1,199 US$699


Of course, the Ada Lovelace architecture won't be any different from Hopper if it didn't introduce some key new features to boost gaming experiences on the GeForce RTX 40 series to keep it well ahead of the GeForce RTX 30 series. Here are those that matter:-

1) Shader Execution Reordering (SER) for efficient ray tracing performance

Similar to an out-of-order execution engine, the Shader Execution Engine (SER) does exactly that - to reorganise previously inefficient workloads and execute those that benefit from the same shader programme execution. In the world of real-time ray tracing content and games, it requires processing how each ray interacts with different materials and light sources. As such, different shader program routines have to be invoked for each pixel based on how the light traverses, bounces and continues its path. This meant that there were many shader workloads coming up consecutively that were completely different from each other and required the engine to process each shader program one at a time. Here's a good representation from NVIDIA of what happens with and without SER:-

Now with SER, the shaders can run many times more efficiently by regrouping similar shader operations. This helps boost shader performance by up to two times and improve in-game frame rates by up to 25%, a pretty big boost. This wasn't an issue before ray-traced workloads as most rasterization jobs have many more similar pixels to be processed in the vicinity; thus, the shader pipeline's efficiency was mostly good.

Shader Execution Reordering (SER) performance gains as estimated by NVIDIA.
 

2) Third-gen RT Cores to plough through complex ray-traced games and content

While Hopper forgoes having any Ray Tracing (RT) cores as it's a data centre solution, naturally, Ada Lovelace had to feature NVIDIA's RT cores to tackle performance-intensive ray tracing workloads as the amazing worlds of ray-traced games have spoilt us since GeForce RTX first arrived on the 20 series of graphics cards.

NVIDIA RTX progressions gen-over-gen (Image source: NVIDIA)

Now into its third iteration, the Ada GPU's far higher clock speeds, a new Opacity Micromap (OMM) Engine (speeds up alpha-tested textures that are often used in foilage, fences and the likes) and a new Displaced Micro-Mesh (DMM) Engine (helps to ray-trace geographically complex scenes) all combine to help it churn more than twice the ray-triangle intersection throughput (read more about the RTX acceleration basics here) and thus increasing RT-FLOP performance by over two times. On the GeForce RTX 4090, gamers and creators have access to 191 RT-TFLOPs of power as opposed to 'just' 78 RT-FLOPs on the GeForce RTX 3090 Ti.

A sample of how the Opacity Micromap Engine helps. (Image source: NVIDIA)

Now, what do you do with so much extra ray tracing throughput? Bring on even more immersive ray-traced environments, of course. Dubbed Ray Tracing: Overdrive Mode, this takes advantage of SER, OMM and NVIDIA Real-Time Denoisers (NRD) which ensure ray-traced output is noise-free without performance tradeoffs of previous-gen denoisers to greatly accelerate and improve the overall quality of advanced ray-tracing workloads.

Even so, traditional tray-traced content only deals with a limited number of light sources. NVIDIA says in a typical AAA RTX title, this varies between 2 to 16 'important' rays and this goes up to 100 in old titles that have been reimagined like Quake 2 RTX and Minecraft RTX. RTX Direct Illumination (RTXDI) now allows game environments to have millions of dynamic lights and manage their resultant shadows - all in real-time, a feat that hasn't been possible until now. Here's NVIDIA's tech demo that showcases what's possible:-

Lights are made of 'true geometry' such that any object in a game can emit light and cast dynamic shadows, enabling an entirely new class of content. RTXDI is the only shadowing algorithm required and will replace all other shadow and ambient occlusion techniques. One of the earliest to feature RTXDI enhancements is Cyberpunk 2077:-

This level of immersion can be experienced by invoking the new Overdrive mode to utilise 16 rays for more gorgeous ray tracing effects than the existing Ultra and Psycho modes that utilise 8 and 10 rays, respectively. Cyberpunk 2077 also benefits from NVIDIA DLSS 3 enhancements, which we'll touch on next.
 

3) Fourth-gen Tensor Cores for big AI uplift

This is mostly thanks to the groundwork laid by the Hopper GPU architecture that debuted several pipelining advances to improve latency, data exchange between streaming multiprocessor blocks, new data format and instruction support and more. All of these means the Tensor Cores are able to push out up to five times more throughput, and with the Transformer Engine, the Ada GPU architecture and process up to 1.4 Tensor-petaFLOPS to accelerate AI tasks like NVIDIA's DLSS.
 

4) NVIDIA DLSS 3.0 for breakthrough performance in AI-powered graphics

Here's what DLSS 3.0 does. (Image source: NVIDIA)

DLSS 3.0 brings on even more performance uplift through entire frames being generated via a DLSS Frame Generation AI network!

To recap things, NVIDIA DLSS (which first debuted in the RTX 20 series) is a big qualitative feature and performance booster for cranking out higher quality visuals by working with lower resolution images processed by deep learning neural network deployed in real-time that's powered by AI processing courtesy of the Tensor Cores.

Here's where DLSS 2.0 stops at, so you can see frame generation is a big theme for DLSS 3.0. (Image source: NVIDIA)

DLSS 2.0, which debuted with the RTX 30 series, improved imaging quality without driving up the performance overhead, thanks to better hardware and a more generic non-game-specific neural network that can be deployed. This meant many more games could be optimised for DLSS without game-specific training that would have otherwise required more time and tuning by developers. The net result has been higher performance and higher quality when enabling DLSS, which is a boon for gamers seeking ever higher performance to get to the next tier of visual fidelity. Plus, over 216 games have now been optimized for DLSS 2.0 with built-in support on the Unity and Unreal engine.

The latest DLSS 3.0 relies on a new speedy Optical Flow Accelerator, which takes in the motion data vectors from the game engine and prior frames to track how a pixel in a current frame will appear in the following frame via the DLSS neural network to generate the new frames. More importantly, this is now done via accurate shadow reconstruction thanks to an optical flow field generated by the optical flow accelerator as relying only on motion vectors (which was used in DLSS 2.0) to forecast the outcome, results in inaccurate shadow and light cast because they don't factor in the camera angle or game flow that differs from the motion of the object/person of interest. For more details, jump here.

(Image source: NVIDIA)

A point to also note is that an optical flow unit existed on the previous generation Ampere GPUs, but the version on Ada is of much higher quality and is more than twice as fast to help it execute the required level of quality expected on DLSS 3.0. Combined with DLSS Super Resolution (where AI constructs up to three-quarters of the frame) and DLSS Frame Generation (which churn out an entire additional frame), NVIDIA is claiming that DLSS 3.0 is helping to reconstruct seven-eights of the total displayed pixels and thus massively increasing performance.

(Image source: NVIDIA)

Exactly how much faster can we expect games to perform when they have been fully reworked to take advantage of the full ray tracing pipeline and to be enriched to work with DLSS 3.0? Here's a snapshot of expectations as illustrated by NVIDIA:-

(Image source: NVIDIA)

In short, DLSS 3.0 has to be exceptionally fast and accurate as it has to generate entire frames that have to sync with traditional frame generation techniques through raw processing. As such NVIDIA Reflex is now an essential part of the DLSS 3.0 experience, as the render queue in a traditional system latency pipeline introduces a lot of latency.

Even for CPU-bound game titles that deal with large worlds or complex physics environments like Microsoft Flight Simulator, thanks to the DLSS Frame Generation that executes completely off the GPU's neural network, GeForce RTX 40 series of graphics cards can safely double the FPS by up to two times right off the bat. Here's what NVIDIA thinks you can expect from their tests:-

NVIDIA DLSS 3.0 is fully compatible with DLSS 2.0 (which made big inroads to DLSS Super Resolution) and builds upon it with DLSS Frame Generation and banking on NVIDIA Reflex. At launch time frame, DLSS 3.0 is already supported by more than 35 games and applications. Check out more demos and examples should you need to get convinced, but we sure are.

(Image source: NVIDIA)
 

5) Dual AV1 Encoders for content creators

Both the GeForce RTX 4090 and 4080 models feature NVIDIA's latest encoders that now have support for AV1 encoding, and there are two such encoders. This enables live streamers to have richer-looking streams as AV1 improves encoding efficiency by up to 40% (so you can choose to notch up your resolution while still running at the same bitrate). At the same time, video editors get to save plenty of time as their jobs can now encode up to two times faster.  

Of course, supporting software is required, so NVIDIA has collaborated with OBS Studio - the leading open-source software for video recording and live streaming, DaVinici Resolve - a visual effects and post-production video editing application, Voukoder - a popular plug-in for Adobe Premier Pro, to enable AV1 encoding capability in their respective software updates in October 2022.

To find out even more benefits that creators and streamers stand to gain from 3D rendering, AI and video exports with the latest GeForce RTX 40 series, NVIDIA has more details penned here that you would want to check out. If you're an avid game modder, you'll also love the new RTX Remix tool to give old games a brand new look and feel with ray-traced goodness to fork out an RTX Remix mod. The first of these that will hit the modding scene is Portal with RTX, and you can bet there will be many more rejuvenated classics.
 

Read Next

1) What should I know about the GeForce RTX 4090, and RTX 4080, and how much will it cost?

2) In pictures: The NVIDIA GeForce RTX 4090 Founders Edition is a glorious-looking card

3) A mammoth-sized graphics card: The ROG Strix GeForce RTX 4090 unboxed

Join HWZ's Telegram channel here and catch all the latest tech news!
Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.