Feature Articles

NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload

By Vijay Anand - 22 Mar 2023

NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload

Generative AI is a big deal now and NVIDIA is right at the center of it. (Image Source: NVIDIA)

Why is Generative AI all the fuss now?

Generative AI like Bing Ai, ChatGPT and more have recently come into the spotlight for using advanced algorithms to generate new data, visuals and more that look, feel and read as if it was produced by humans. Generative Adversarial Networks (GAN) are ideal for creating visual content, while Generative Pre-Trained (GPT) language models parse data already available on the internet (or other proprietary data sets supplied) to generate an output such as an answer to a query, all the way to producing entire ‘new’ articles. The use of AI for these aspects isn’t new, but breakthroughs in how it understands these queries and produces output that is far more usable, legible, and relatable to the average user are what make generative AI a potent tool.

The concerns are, of course, controversial and plenty, but these are still early days. Yet, the technology is very ideal for creating more data to train and improve the various models that can fast track several menial or mundane tasks that require some form of inferencing to take the next best step forward, and even making fully automotive cars a reality, where an automotive drive simulation model is continually being trained through endless varieties of new environmental data generated and trained virtually, thus building an ever more solid pre-trained model.

Here's more reading on this subject from global consulting firm McKinsey & Company for more insights, and the various industries that stand to gain from generative AI

(And no, this article wasn't churned with generative AI.)

Enter NVIDIA AI Foundations: Enabling companies to create in-house custom generative AI models

So now that we know why Generative AI is so valuable and important, that brings us to NVIDIA’s big push to support enterprises with cloud services to create their own customized large language models (LLMs, which ChatGPT is a prime example) and visual generation models for AI applications. More specifically, these custom generative AI models are developed and trained with the company’s own proprietary data for their unique domain-specific offerings.

(Image Source: NVIDIA)

This is made possible with NVIDIA AI Foundations, which is a set of cloud services to enable businesses to build, refine and operate such LLMs and generative AI models.

  • NVIDA NeMo cloud service enables developers to make large language models (LLMs) more relevant for businesses by defining areas of focus, adding domain-specific knowledge and teaching functional skills.
     
  • NVIDIA Picasso is a cloud service for building and deploying generative AI-powered image, video and 3D applications with advanced text-to-image, text-to-video and text-to-3D capabilities to supercharge productivity for creativity, design and digital simulation through simple cloud APIs.
     
  • NVIDIA BioNeMo is a new cloud service that debuted today to accelerate life science research, drug discovery, protein engineering and research in the fields of genomics, biology, chemistry and modular dynamics.

These services run on NVIDIA DGX Cloud, which is accessible via a browser. They are currently available to early-access customers and are in the private preview stage. Developers can use these models offered on each service through simple APIs and when the models are ready for deployment, enterprises can run inference workloads at scale using the NVIDIA AI Foundations cloud services.

Industry Leaders team up with NVIDIA to advance productivity for creative professionals

Adobe, Getty Images, Shutterstock, and Morningstar are among the companies creating AI models, applications, and services with the newly announced NVIDIA AI Foundations.

Adobe today announced they will expand their longstanding research and development partnership to create the next generation generative AI models with NVIDIA. To accelerate the workflows of creators and marketers, some of these models will be jointly developed and brought to market through Adobe Creative Cloud flagship products like Photoshop, Premiere Pro and After Effects, as well as through NVIDIA Picasso.

NVIDIA and Getty Images are collaborating to train responsible generative text-to-image and text-to-video foundation models. The models will allow the creation of images and video using simple text prompts and will be trained on Getty Images’ fully licensed assets.

NVIDIA and Shutterstock are collaborating to train a generative text-to-3D foundation model using the NVIDIA Picasso service to simplify the creation of detailed 3D models and reduce the time required to build 3D models from days to minutes.

New GPUs power Inference Platforms to tackle various Generative AI workloads

(Image source: NVIDIA)

To augment NVIDIA’s push to help create new and emerging custom generative AI models via NVIDIA Foundation cloud services, they’ve also launched a slew of new GPUs and platforms to help developers build and power these new AI applications based on the NVIDIA Ada Lovelace, Hopper and Grace Hopper processors.

The rise of generative AI is requiring more powerful inference computing platforms. The number of applications for generative AI is infinite, limited only by human imagination. Arming developers with the most powerful and flexible inference computing platform will accelerate the creation of new services that will improve our lives in ways not yet imaginable. – Jensen Huang, Founder and CEO of NVIDIA. 

1) NVIDIA L4 for AI Video

(Image souce: NVIDIA)

The new NVIDIA L4 is the direct replacement to the popular T4 GPU, which was the first to use Tensor Cores and designed expressly for AI inferencing workloads to analyze novel data inputs to predict and estimate a desired outcome based on pre-trained models.

The T4 was powered by the Turing microarchitecture, which was the first to support and accelerate ray traced workloads. The new L4, based on the Ada Lovelace GPU architecture (this is what powers the GeForce RTX 40 series) supporting AI-powered DLSS 3 is rated to deliver over 4x speedup in real-time rendering performance over Omniverse, and is able to dish out 3x higher ray-traced performance.

With this enhanced throughput, the L4 GPU is positioned for AI video workloads for tackle real-time video decoding, transcoding, video content moderation, language translation, video call enhancement features such as background replacement, relighting, eye contact, augmented reality and more. The new GPU’s dual AV1 encoders are also excellent reasons why the L4 is ideal for these AI video tasks. In fact, a single 8-GPU L4 server can replace over a hundred traditional dual-socket CPU servers in processing AI video. This is a massive savings in total cost of ownership over older infrastructures.

Better yet, the L4 is also designed in the same low profile form factor and a similar 72W power envelope, which makes upgrading existing T4 powered servers with an L4 a breeze, while improving AI inferencing prowess by a good margin.

NVIDIA GPUs compared
Graphics Card L4 T4
GPU Ada Lovelace Turing
(TU104)
Process 4nm
(TSMC)
12nm FinFET 
(TSMC)
CUDA cores TBD 2560
Tensor Cores Yes
(4th Gen)
320
(2nd Gen)
Tensor Performance 1 (FP16) 242 TFLOPS 65 TFLOPS
RT Cores Yes
(Gen 3)
40
(Gen 1)
RT Performance 2x of T4 TBD
GPU base / boost clock speeds 795MHz / 2040MHz 585MHz /
1590MHz
 Memory 24GB GDDR6 with ECC 16GB GDDR6 with ECC
 Memory clock speed 6,251MHz 5,000MHz
Memory bus width 192-bit 256-bit
Memory bandwidth 300GB/s 320GB/s
Interface PCIe 4.0 x16 PCIe 3.0 x16
Form Factor 1-slot, Low Profile 1-slot, Low Profile
TDP 72W 70W

1. Effective Tensor performance with and without using the Sparsity feature.

In fact, one of the first deployments of the NVIDIA L4 is in Google's Cloud, offering it up as their G2 compute engine family cloud VM solution offering significant performance improvements on HPC, graphics, video transcoding, in addition to improving performance per dollar value of handling AI inferencing in the cloud to tackle the explosive field of generative AI.

2) L40 for Image Generation

(Image source: NVIDIA)

The L40 was actually announced in 2022, but it wasn’t until recently that it saw some action. Based on the Ada Lovelace RTX GPU with over 18,000 CUDA processing cores 142 RT Cores, the L40 packs quite a punch as these specs place it well ahead of what the RTX 4090 packs. But unlike the RTX 4090 that’s optimized for high clock speeds, rasterization and ray-traced performance with active cooling and a higher power budget, the L40 is a passively cooled design with a 300W TDP and is meant to take advantage of the airflow paths designed within rack servers.

NVIDIA GPUs compared
Graphics Card L40 RTX 6000 Ada Generation RTX 4090 A40
Class Data Centre Professional Consumer Data Centre
GPU Ada Lovelace
(AD102)
Ada Lovelace
(AD102)
Ada Lovelace (AD102) Ampere
(GA102)

Process

4nm
(TSMC)
8nm
(Samsung)
Transistors 76 billion 76 billion 76 billion 28 billion
Streaming Multi-processors (SM) 142 142 128 84
CUDA cores 18176 18176 16384 10752
Tensor Cores 568
(Gen 4)
568
(Gen 4)
512
(Gen 4)
336
(Gen 3)
Tensor Performance 1 (FP16) 362.1 TFLOPS TBD TBD 299.4 TFLOPS
RT Cores 142 
(Gen 3)
142 
(Gen 3)
128
(Gen 3)
84
(Gen 2)
RT Performance 209 TFLOPS 210 TFLOPS TBD 58 - 75.62 TFLOPS
GPU base / boost clocks (MHz) 735 / 2490 TBD 2230 / 2520 1305 / 1740
Memory 48GB GDDR6X with ECC 48GB GDDR6X with ECC 24GB GDDR6X 48GB GDDR6 with ECC
Memory bus width 384-bit 384-bit 384-bit 384-bit
Memory bandwidth 864GB/s 960GB/s 1,018GB/s 696GB/s
Interface PCIe 4.0 x16 PCIe 4.0 x16
NVLink No Yes
TDP 300W 300W 450W 300W
Price (at launch) -- US$6,800 US$1,599 --

1. Effective Tensor performance with and without using the Sparsity feature.
2. Peak rates based on GPU Boost Clock.


(Image source: NVIDIA)

The L40 also packs 48GB of GDDR6 memory with ECC, perfect for Omniverse Enterprise, rendering, 3D graphics, NVIDIA RTX virtual workstation, AI training and data science. In fact, it’s the backbone of the NVIDIA OVX Server that’s meant for building large-scale Omniverse digital twins.

3) H100 NVL for large language model (LLM) deployment

Note the dual-card NVLink'ed H100 NVL pair, and there are four of them in this server for illustration. (Image source: NVIDIA)

The H100 based Hopper GPU architecture is an awesome product that’s focused for data center AI acceleration as it foregoes RT Cores and packs in a far more speedier memory interface to connect with HBM memory.  As fast as the H100 is, NVIDIA is already aware that it needs to do more now to be the driver powering AI generative services like ChatGPT at scale. At GTC 2023, NVIDIA announced the dual PCIe card based H100 NVL that are NVLink’ed to each other. To make it more ideal than two existing H100 PCIe products (they pack 80GB of memory), the new H100 NVL packs in 94GB each, for a grand total of 188GB graphics memory, boasting 7.8TB/s graphics memory bandwidth. Additionally, the GPU configuration of the H100NVL is identical to the H100 SXM SKU, thus the H100 NVL is much faster than the H100 PCIe, even if the latter were to be NVLink’ed.

NVIDIA H100 variants
Graphics Card H100 NVL H100 SXM5 H100 PCIe
GPU Hopper
(GH100) x 2
Hopper
(GH100)
Hopper
(GH100)
Process 4N
(TSMC)
4N
(TSMC)
4N
(TSMC)
FP32 performance 134 TFLOPS 67 TFLOPS 51 TFLOPS
FP16 Tensor Performance 1 3,958 TFLOPS 1,979  TFLOPS 1,513 TFLOPS
GPU boost clock speeds TBD TBD TBD
GPU Memory 188GB HBM3
(94GB  x 2)
80GB HBM3 80GB HBM2e
 Memory clock speed TBD TBD TBD
Memory bus width TBD 5120-bit 5120-bit
Memory bandwidth 7.8TB/s 3.35TB/s 2TB/s
Interconnect 3rd-gen
NVLink Bridge (600GB/s)
+ PCIe 5.0
4th-gen
NVLink (900GB/s)
+ PCIe 5.0
3rd-gen
NVLink (600GB/s)
+ PCIe 5.0
GPU board form factor Dual PCIe 5.0, air-cooled SXM5 PCIe 5.0, air-cooled
TDP 2x 350-400W
(configurable)
700W 350W
Price -- -- --

1. Effective Tensor performance with and without using the Sparsity feature.
 

According to NVIDIA, the H100 NVL equipped server (with quad H100 NVL) is over 10x faster than a HGX A100 server (eight H100 SXM) processing GPT-3. That’s a phenomenal increase in language model processing.

4) NVIDIA Grace Hopper for Recommendation Models

(Image source: NVIDIA YouTube)

Lastly, NVIDIA also has the Grace Hopper super chip to process giant data sets in AI databases and graph recommendation models, where the module’s super fast and low-latency chip-to-chip NVLink-C2C enables over 900GB/s interconnect bandwidth between the ARM-based Grace chip and the Hopper GPU. This allows a giant query to be processed on the CPU and then be immediately transferred over to the Hopper GPU for inference processing that’s over seven times faster than PCI Express 5.0.

Join HWZ's Telegram channel here and catch all the latest tech news!
Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.