Preview: AMD Vega next-generation GPU architecture
Vega 10 preview
Much has been speculated about AMD’s next-generation graphics chip ever since GDC 2016 when the Vega codename was officially unveiled. The company is set to publicly reveal more information at CES 2017 happening in Las Vegas right now, but if you’ve been following the Vega trail, you’re bound to have stumbled across the following word cloud people have cleverly pulled from AMD’s ve.ga teaser website.
Now, we’ve actually gotten a sneak architecture preview sometime back in December 2016 (which we couldn’t talk about then, you know, non-disclosure agreements and all). That word cloud basically summarizes everything AMD has shared with us so far, but now we can put into context what those terms mean.
Most of the features refer to HBM2 memory—which we already know from the GDC reveal last year –that Vega 10 will be using. So 2X Bandwidth Per Pin and 8X Capacity Per Stack imply improvements of HBM2 memory vs HBM (which you can read more about here). The 4X Power Efficiency claim is likely a comparison of HBM2 vs GDDR5 instead. While HBM2 will likely be more power efficient than HBM relative to bandwidth performance, but we don’t think you’ll see a 4X improvement. We’ll probably know more in the coming days.
Here’s where it gets interesting. High Bandwidth Cache (HBC) and High Bandwidth Cache Controller (HBCC) is just AMD’s terminology for the HBM2 memory and memory controller on the Vega chip. However, if you look the architecture diagram below, the HBCC connects the HBC and L2 cache directly to all types of system memory, theoretically allowing the Vega chip to scale up to (you guessed it) 512TB of accessible virtual address space. Not only that, in line with AMD’s new approach to machine intelligence and efficiency, the HBCC is supposedly quite smart at adapting to application frame buffer needs, reducing memory allocation waste and freeing up unused memory. All these mean that a Vega-based card will not only be able to handle much larger data sets, but do so more efficiently as well.
Next, let’s move on to an updated geometry engine in Vega. A geometry engine in a graphics rendering pipeline normally composes of a Vertex Shader (processes individual vertices or calculates vertices as input for the Geometry Shader) and Geometry Shader (processes whole primitives or create new ones using output of the Vertex Shader as input). Vega’s new geometry shader pipeline basically identifies all primitives and processes them in one go. With this change, AMD claims of improved load balancing and more than 2X Peak Throughput per Clock.
Then you’ve got the Next Generation Pixel Engine. As far as features, what’s new is the Draw Stream Binning Rasterizer. In a nut shell, it’s an intelligent rasterizer that is designed to improve performance by saving power and bandwidth. It does this by having an on-chip bin cache to fetch primitive batches, identifies pixels that will not be shown on the screen and culls them before sending them to the shaders. This should heavily cut down on shading needs, especially in modern games with complex scenes and tons of overdraw. If you're the real technical type, you might be interested in this patent application AMD filed in August 2016, which is likely the in depth process of how the Draw Stream Binning Rasterizer works.
Excerpt from U.S. Patent Application 20160371873 "Hybrid Render with Preferred Primitive Batch Binning and Sorting":
Pixels are produced by rendering graphical objects in order to determine color values for respective pixels. Example graphical objects include points, lines, polygons, and three-dimensional (3D) higher order surfaces. Points, lines, and polygons represent rendering primitives which are the basis for most 3D rendering instructions. More complex structures, such as 3D objects, are formed from a combination or a mesh of such primitives. To display a particular scene, the primitives with potential contributing pixels associated with the scene are rendered individually by determining pixels that fall within the edges of the primitives, and obtaining the attributes of the primitives that correspond to each of those pixels.
Because there are often thousands, millions, or even hundred millions of primitives in a 3D scene, the complete rasterization of each primitive individually can result in less than optimal system performance while rendering complex 3D images on a display screen. Such conventional graphics systems suffer from repeated color and depth value reads and writes from memory as the rasterization process moves from one primitive to the next. Immediate shading of rasterized pixels can result in unnecessary processing overhead and overall inefficient use of system memory bandwidth.
Another point to take note is that the pixel engine now has direct access to L2 cache, which means the entire rendering pipeline has coherent memory access, which according to AMD will help improve differed shading operations. The diagram below shows older architectures compared to Vega.
Lastly, we look at the heart of the new chip itself, the Next Generation Compute Engine, Vega NCU and Rapid Packed Math. Now, Next Generation Compute Engine in the word cloud refers to the Vega NCU (which itself is short for Next-generation Compute Unit), and Rapid Packed Math refers to the Vega NCU’s native support for dual 16-bit half float operations. This means the Vega NCU can perform two 16-bit ops at the same time in the place of one 32-bit op. While Vega is the first consumer GPU to support this, it isn't actually the first. The Sony PlayStation 4 Pro has this capability. Wait, isn't the PS4 Pro GPU based on Polaris? Well yes, but it has been modified. A deep dive hands-on by Eurogamer back in October 2016 on the first native 4K PS4 Pro title, Mantis Burn Racing, confirms this.
Excerpt from Eurogamer Digital Foundry:
Of course, we already knew that the Pro graphics core implements a range of new instructions - it was part of the initial leak - but we didn't really know exactly what they could actually do. As we understand it, with the new enhancements, it's possible to complete two 16-bit floating point operations in the time taken to complete one on the base PS4 hardware. The end result from the new Radeon technology is the additional throughput required to making Mantis Burn Racing hit its 4K performance target, though significant shader optimisation was required on the part of the developer.
In short, there's more to PS4 Pro's enhancements than teraflop comparisons suggest - and we understand that there are more 'secret sauce' features still to be revealed. At the PlayStation Meeting, Sony staff told me that the enhancements made to the core hardware go beyond the checkerboard upscaling technology, and the new instructions certainly support Mark Cerny's assertion that the PS4 Pro possesses graphics features not found in AMD's current Polaris line of GPUs. Interesting stuff, and we look forward to learning more.
AMD has also revealed that the Vega NCU is not just designed to run at higher clock speeds (a boon for overlcockers finally), but they’ve also managed to double the instructions per clock as well.
While no actual details were revealed back in December on card NCU configurations or clock speeds, the Vega NCU is supposed to be able to perform 128 FP32 ops per clock (or 256 FP16 ops per clock). However, if you check out our Radeon Instinct preview, (I’ve extracted the slide here for convenience) you’ll notice the passively cooled Radeon MI25 card based on Vega will have a theoretical 25 teraflops FP16 performance based on its name (12.5 teraflops FP32). Now, those are already some monster numbers. We can’t wait for gaming card details to start popping up.