Note: This article was first published on 7 July 2019.
AMD's Navi architecture has seemed as distant as its namesake for the longest time, but the wait is over at last. The company's Radeon RX 5700 XT and 5700 turn a fresh page for AMD on the GPU front, ditching the longstanding Graphics Core Next (GCN) architecture in favor of a brand new design called RDNA.
According to AMD, RDNA (or Radeon DNA) is the start of a serious effort to compete at the highest echelons of the graphics card market. However, when AMD first announced its new Radeon cards ahead of E3 this year, it was obvious that the company was framing the Radeon RX 5700 XT and 5700 as the start of its return to the high-end. In other words, AMD wants to get there, but it still has some way to go, which should give you an idea of where the new Navi cards stand.
They aren't anywhere close to challenging the NVIDIA GeForce RTX 2080 Ti or even the 2080, but they are quite competitive in the mainstream segment currently occupied by the GeForce RTX 2060 Super and 2060. In fact, this was a late pivot on the part of AMD, announcing a US$50 price drop just a day ahead of launch. AMD originally thought the Radeon RX 5700 XT would be squaring up against the GeForce RTX 2070, but with the release of the GeForce RTX 2070 Super, the company clearly decided to position it against the GeForce RTX 2060 Super instead, which also costs the same at US$399.
No matter though. The mainstream segment is the largest part of the gaming GPU market and where the volume really is, so it's really good to see even more competition in that area. But before I dive into the results, here's a look at what's new with AMD's RDNA architecture and the new Radeon RX 5700 series.
It's been seven long years and AMD is finally bidding goodbye to GCN. RDNA is supposed to be more efficient and do more work per clock cycle, and be overall better suited for gaming workloads.
At its core, it enables new instructions better suited for visual effects like volumetric lighting and features a redesigned Compute Unit (CU), improved multi-level cache hierarchy for lower latencies, and a more streamlined graphics pipeline. More specifically, AMD has rejigged the SIMD units (short for Single Instruction Multiple Data) as well, where the redesign prioritizes single-threaded performance and improves the effective IPC.
However, the biggest differences between GCN and RDNA can probably be boiled down to the changes AMD made to the CU. While AMD did introduce the Next-Generation Compute Unit (NCU) on Vega, the RDNA CU has a more gaming-oriented design.
It doubles the instruction rate of GCN and features twice the number of scalar units and schedulers. This was implemented specifically to double the possible instruction rate in order to create a more efficient CU for standard graphics workloads encountered in games.
GCN also used SIMD16 units, which means they could process 16 instructions at a time. However, the four SIMD16 units in a single GCN execution unit was good for complex calculations but not the best for gaming, because they operated on a 4-cycle issue, which meant they couldn't process an instruction in a single clock cycle. On the other hand, the new RDNA execution unit uses two SIMD32 units instead with a single clock cycle issue, which allows for higher throughput and more efficient utilization of the GPU. With RDNA, there's no need to wait around for an instruction to pass through the whole four cycles.
This reduces latency while adding greater parallelism, where a 64 threads can be treated as two Wave32 instructions and executed in a single clock by the two SIMD32 units. RDNA added support for Wave32 execution in order to facilitate higher performance in games, but Wave64 is still supported.
The concept of resource pooling is also an important one, and it lets two adjacent CUs work together and function as something called a Work Group Processor, which improves parallelism and allows for access to up to four times the cache bandwidth.
RDNA also includes a new L1 cache design, with an extra 128kb of dedicated L1 cache for each of the four compute engines on the die. The load bandwidth from the L0 cache to ALU has also been doubled, in order to reduce the cache latency at each level and improve effective bandwidth. In addition, the extra L1 cache reduces the demand on the L2 cache as well, which further helps increases bandwidth somewhat.
On top of that, the Delta Color Compression (DCC) algorithm has been improved and made available to the larger part of the cache subsystem. Shaders can now read and write compressed color data, while the display engine can also read compressed data in the frame buffer without needing to have it decompressed first. Overall, this boosts effective bandwidth throughout the GPU.