NVIDIA's new Ampere A100, 54-billion transistor GPU will revolutionize data center design

NVIDIA's A100 packs 20x higher AI performance boost and has multi-instance GPU capability to act as seven independent GPUs!

By Vijay Anand - 14 May 2020

Far from your computer screens and deep inside the ‘cloud’ is where the new computational war is being fought – right in the heart of the data centers. NVIDIA knows this and has been heavily investing in this domain to be a data center scale company with its own solutions to help overcome industry-scale problems. Following the footsteps of the amazing Tesla P100 (Pascal) in 2016, and Tesla V100 in 2017 using the Volta GPU architecture, today at GTC 2020, NVIDIA CEO Jensen Huang unveiled its most ambitious GPU yet to re-architect the data center.

Enter the NVIDIA A100 Tensor Core GPU, the company’s first Ampere GPU architecture based product. It’s the first of its kind to pack so much elasticity and capability to solve many of the data center woes where there’s immense application diversity and it’s difficult to utilize the hardware efficiently. Simplistically, there are various kinds of servers in a data center such as a cluster of storage servers, another for general-purpose computing, and so on for inferencing, training, analytics, HPC, cloud gaming and more.

NVIDIA’s A100 is engineered to solve this by offering up to 20x higher AI training and inferencing boost, coupled with multi-instance GPU capability.

Here are the five breakthroughs that made the A100 Tensor Core GPU possible:-

1) The NVIDIA Ampere architecture

Based on a 7nm lithography process by TSMC, this 3D chip is the largest-ever silicon produced, trumping the Tesla V100 ever so slightly (826mm² vs. 815mm²) but packing phenomenal amount of firepower more than its predecessors with 54 billion transistors.

On it, there are 108 streaming multiprocessors (SMs) which is 35% more SMs than the Tesla V100 GPU (that had 80 SMs). This resulted in a direct scale-up in the number of CUDA cores to 6,912 vs. 5,120 on the previous generation. There is however a drop in the number of Tensor cores from 640 to 432 on the A100 GPU, but that’s easily taken care of by the other following breakthroughs listed below.

Last but not least, memory capacity has been boosted significantly to 40GB HBM2 memory for each A100 GPU and peak memory bandwidth is now a staggering 1.6TB/s! Supplied by Samsung, the HBM2 memory is now 70% faster than the previous version used on the V100 GPU. Considering the V100’s original 16GB config (and later 32GB) with 900GB/s throughput, the memory subsystem has seen a massive upgrade and helps make Multi-Instance GPU capability a reality (more on that later).

Check out the core specs in this rundown below and compare it with the previous generations.

Full specs of the NVIDIA A100 data center GPU.

2) Third-gen Tensor Cores with TF32

TensorFloat-32 (TF32) is a new operational mode and format that is a hybrid of maintaining FP16’s precision level of 10 bits, while using the range of an FP32’s format whose exponent is 8 bits in value. This is an ideal alternative to using FP32 for processing single-precision math that’s prevalent in AI training, deep learning and HPC applications that use matrix math to tackle tensor operations. A more optimal processing format reduces memory footprint bloat and speed-up processing.

Best of all, there’s no need for code re-writes, though new cuDNN libraries and TensorFLow frameworks are needed to support the TF32 format. With these in place, the A100 can take in existing FP32 inputs, process them using TF32 (as this is enabled by default within the GPU core) to tackle tensor operations and output the results back in FP32 format.

This has been tested to provide up to 10x speedups in single-precision workloads over the FP32 formats with more details covered in this blog page. AI performance speedups can go up to 20x when coupled with the next breakthrough, structured sparsity to improve efficiency.

Similarly, FP16 and FP64 throughput have gone up to 2.5x to bolster its math capabilities for HPC needs.

3) Structural Sparsity

This is a new efficiency technique to harness the inherent sparse nature of AI math in matrix operations to double the performance. By using a fine-grained pruning algorithm to compress (essentially removing) small and zero-value matrices, the GPU saves computing resources, power, memory and bandwidth. This effectively means you’ve more headroom to process denser math with more packed matrices. More detailed reading and observations over here.

While this might sound like data is being thrown away and it may lead to inaccurate outcomes, NVIDIA’s tests have documented that the sparsity approach helps maintain the accuracy on a wide variety of AI-related tasks revolving around inferencing and training. This helps boost execution speed by up to two times for image classification, object detection, language translation and training convolutional and recurrent neural networks.

4) Multi-Instance GPU (MIG)

Recollect the problem of application and server diversity to cater to the workloads? That’s for Multi-Instance GPU (MIG) functionality in the Ampere architecture comes into play. This helps partition the A100 GPU up to seven independent GPU instances or the equivalent of seven virtual GPUs with its own resources (memory, cache, streaming multiprocessors) to tackle various workloads. Here's a glimpse of performance scaling that can be obtained:-

https://www.youtube.com/embed/KnUBrjJfccI

Scale-up the GPU deployment and it will effectively notch up data center efficiency tremendously by drastically reducing the traditionally required footprint. So-called right-sizing the GPU based on the target workloads, MIG is also responsible for finally allowing unified AI training and inferencing acceleration on the same GPU and machine. You may refer to this blog for more reading and examples where MIG works well.

One last note about MIG is that it does not support DirectX or OpenGL running modes and will only be available in GPGPU mode. In other words, don't expect the A100 to suddenly excel in games miraculously, but MIG will work perfectly fine in most other scientific work needs where the GPU is called upon in the first place.

5) Third-gen NVLink

The Ampere GPU architecture also brings with it yet another leap in the NVIDIA’s chip-to-chip communication link, NVLink. The first-gen NVLink on the Tesla P100 managed 160GB/s between GPUs, the second-gen NVLink doubled it to 300GB/s on the Tesla V100. No surprises that the latest version in its third iteration of NVLink now boosts that speed to 600GB/s!

NVIDIA A100 with NVLink GPU-to-GPU connections. Note that 12 NVLink ports are now available for each A100's inter-chip communication. (Image Source: NVIDIA)

Raw throughput aside, the A100 GPU has now doubled the number of NVLink ports to 12 to better keep up with each GPU’s massive processing throughput. With so many more links and at that throughput rate, signal integrity becomes an issue. As such, for the third-gen NVLink, NVIDIA has reduced the width of each signal line from 8 bits (on the previous generation) to 4 bits, but in return, they were able to scale the link speeds higher. The net throughput is about the same, but they've improved signal integrity when running more NVLink ports.

This monster of a GPU, NVIDIA A100, is now immediately available through NVIDIA’s new DGX A100 supercomputer system that packs 8 of the A100 GPUs interconnected with NVIDIA NVLink and NVSwitches.

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.