Obsessed with technology?
Subscribe to the latest tech news as well as exciting promotions from us and our partners!
By subscribing, you indicate that you have read & understood the SPH's Privacy Policy and PDPA Statement.
News
News Categories

NVIDIA DGX A100 supercomputer is half the cost and size with double the performance! How does it do it?

By Vijay Anand - on 15 May 2020, 4:54pm

NVIDIA's DGX A100 supercomputer is the ultimate instrument to advance AI and fight Covid-19

Note: This article was first published on 15 May 2020.

(Image source: NVIDIA)

If the new Ampere architecture based A100 Tensor Core data center GPU is the component responsible re-architecting the data center, NVIDIA’s new DGX A100 AI supercomputer is the ideal enabler to revitalize data centers. With 5 petaflops of AI performance, it also packs the power and capabilities of an entire data center into a single machine. And given the advancements of the new Ampere A100 GPU, this is not just a marketing statement.

NVIDIA DGX A100 is the ultimate instrument for advancing AI. NVIDIA DGX is the first AI system built for the end-to-end machine learning workflow — from data analytics to training to inference. And with the giant performance leap of the new DGX, machine learning engineers can stay ahead of the exponentially growing size of AI models and data. - Jensen Huang, founder and CEO of NVIDIA

What’s under the hood of the DGX A100

The DGX A100 is NVIDIA’s third iteration of its supercomputing unit and it’s packed with eight of NVIDIA’s latest Ampere-based A100 GPUs that communicate via 6 upgraded NVSwitches (which, mind you, were already impressive in their last iteration) to match up with the speeder third-gen NVLinks on the new GPUs. The second-generation NVSwitch interconnect fabric now boasts an inter-GPU bandwidth of 600GB/s (thanks to speedier NVLinks on the A100) and brings the total inter-GPU communication bandwidth of 4.8TB/s across all the GPUs in the DGX A100 supercomputer.

While that’s the same as what the DGX-2, DGX A100 does this with far fewer components. The NVSwitch interconnect fabric does however theoretically allow scaling it further to support 16 GPUs and 16 NVSwtiches, which would then bring the total inter-GPU communication bandwidth to 9.6TB/s.

NVSwich interconnect fabric can go up to 16 switches to service 16 GPUs - theoretically. However, the DGX A100 is so powerful, it only requires of the components of the DGX-2 and yet leapfrog it in performance and capability.

The eight GPUs combined bring 320GB of total GPU memory to the system using higher speed HBM2 memory from Samsung. And although total GPU memory is down from 512GB on the DGX-2, the new DGX A100 has far higher speed memory and helps close the gap to boast 12.4TB/s peak memory throughput (just a little less than 14.4TB/s in the DGX-2).

Connectivity out of the box to scale up data center capabilities with more DGX supercomputers is courtesy of NVIDIA’s new acquisition that allows them to use high-speed Mellanox HDR 200Gbps interconnects – which are twice the throughput that Infiniband 100GbE offered on the DGX-2. The DGX A100 offers 8x single-port Mellanox ConnectX-6 HDR Infiniband/200GbE and with clustering, supports a total peak interconnect performance of 200GB/s! It also has a single dual-port ConnectX-6 for data and storage networking needs.

In a surprising move, NVIDIA’s latest supercomputer dumps Intel for AMD’s EPYC 7742, 64-core server processor! This speaks volumes of AMD’s speedier advancement in the server and data center scene and NVIDIA’s confidence in their supply chain. There’s two of them on the motherboard baseboard within then DGX A100, so that’s a total of 128 CPU cores and 1TB of system memory. Storage space is serviced by dual 1.92TB M.2 NVMe drives to host the OS while non-OS storage comes up to be a total of 15TB utilizing quad 3.84TB U.2 NVMe drives.

 

Here’s what makes the NVIDIA DGX A100 even more impressive

So almost all the quantity, frequency, bandwidth and throughput figures are quite a fair bit higher – and we stress almost because in some ways, the DGX A100 does come with less that helps it bring its sticker price way lower than you would imagine. The previous AI supercomputer, DGX-2, costs a whopping US$399,000 and puts out 2 teraflops of AI performance.

The new DGX A100 costs ‘only’ US$199,000 and churns out 5 teraflops of AI performance –the most powerful of any single system. It is also much smaller than the DGX-2 that has a height of 444mm. Meanwhile, the DGX A100 with a height of only 264mm fits within a 6U rack form factor. How can NVIDIA serve out something that’s half of the cost of its predecessor, nearly half the size and more than doubles the performance capability?

For starters, the DGX A100 only uses 8 GPUs vs. 16 on the DGX-2, which is enough reason for massive cost savings from a silicon consumption and complexity management perspective. Less GPUs mean less NVSwtiches deployed, and that is also halved in the DGX A100. The reduces number of components mean it has one less plane used to accommodate all the GPUs it needs compared to the DGX-2. If you managed to playback the DGX A100 video intro above, you would get a glimpse of the back of the supercomputer and its layout that has been simplified. The bottom plane houses all the redundant PSUs, the next layer houses all the external system connectivity options, followed by the single GPU plane at the top. All of this adds up to tremendous cost savings while the improved GPU architecture and connectivity options ensure you get much more for less.

It doesn’t just stop there. Thanks to the Multi-Instance GPU (MIG) capability of the new A100 GPU, each GPU can partition itself into seven discrete instances, fully isolated from each other and running various workloads while still having their own slice of high bandwidth memory, cache, compute cores and more. With the DGX A100’s eight GPUs, this gives the administrator the ability carve out up to 56 GPU instances. This key feature makes along with massive processing throughput boosts make the DGX A100 the most versatile high-density compute system in the market offering a consolidated option for training, inferencing, analytics needs into a unified, easy-to-deploy AI system with unprecedented levels of performance.

What a typical AI data center looks like today.

In fact, NVIDIA’s painted a picture of a typical AI data center setup today that might consist of 50 DGX-1 systems for AI training and 600 CPU systems for AI inferencing that could horde up to 25 racks in space, sip up to 630kW of power and cost about US$11 million in infrastructure alone can be consolidated into a single rack with just five of the DGX A100 AI supercomputers to do the job at just a million dollars and consuming 28kW of power. That’s tremendous savings in infrastructure costs, running costs and carbon footprint.

How just 5 units of DGX A100 AI supercomputers could overhaul data centers completely.

Unlike its predecessors, the DGX A100 is no longer just another option in NVIDIA’s supercomputer lineup, but this effectively retires both the DGX-1 and DGX-2 by a long shot. The performance to cost ratio is simply off the charts – comparatively speaking.

 

Availability and who’s adopting it?

Immediately available, DGX A100 systems have begun shipping worldwide, with the first order going to the U.S. Department of Energy’s (DOE) Argonne National Laboratory, which will use the cluster’s AI and computing power to better understand and fight COVID-19.

We’re using America’s most powerful supercomputers in the fight against COVID-19, running AI models and simulations on the latest technology available, like the NVIDIA DGX A100.  The compute power of the new DGX A100 systems coming to Argonne will help researchers explore treatments and vaccines and study the spread of the virus, enabling scientists to do years’ worth of AI-accelerated work in months or days. - Rick Stevens, associate laboratory director for Computing, Environment and Life Sciences at Argonne. 

Here are the other early adopters:-

  • The Center for Biomedical AI - At the University Medical Center Hamburg-Eppendorf, Germany, they will leverage DGX A100 to advance clinical decision support and process optimization.
     
  • Chulalongkorn University - Thailand’s top research-intensive university will use DGX A100 to accelerate its pioneering research such as Thai natural language processing, automatic speech recognition, computer vision and medical imaging.
     
  • German Research Center for Artificial Intelligence (DFKI) will use the DGX A100 systems to further accelerate its research on new deep learning methods and their explainability while significantly reducing space and energy consumption.
     
  • Harrison.ai - A Sydney-based healthcare AI company will deploy Australia’s first DGX A100 systems to accelerate the development of its AI-as-medical-device.
     
  • The UAE Artificial Intelligence Office - First in the Middle East to deploy the new DGX A100 is building a national infrastructure to accelerate AI research, development and adoption across the public and private sector.
     
  • VinAI Research - Vietnam’s leading AI research lab, based in Hanoi and Ho Chi Minh City will use DGX A100 to conduct high-impact research and accelerate the application of AI.
     

Read Next (1): NVIDIA's new Ampere architecture will soon power cars!
Read Next (2): Meet NVIDIA's new 54-billion transistor Ampere GPU, the A100