NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload

NVIDIA Foundations help businesses create custom 'ChatGPT' models through proprietary data while the L4, L40, H100 NVL and Grace Hopper super chip will help build and power these services at every scale.

Generative AI is a big deal now and NVIDIA is right at the center of it. (Image Source: NVIDIA)

Generative AI is a big deal now and NVIDIA is right at the center of it. (Image Source: NVIDIA)

Why is Generative AI all the fuss now?

Generative AI like Bing Ai, ChatGPT and more have recently come into the spotlight for using advanced algorithms to generate new data, visuals and more that look, feel and read as if it was produced by humans. Generative Adversarial Networks (GAN) are ideal for creating visual content, while Generative Pre-Trained (GPT) language models parse data already available on the internet (or other proprietary data sets supplied) to generate an output such as an answer to a query, all the way to producing entire ‘new’ articles. The use of AI for these aspects isn’t new, but breakthroughs in how it understands these queries and produces output that is far more usable, legible, and relatable to the average user are what make generative AI a potent tool.

The concerns are, of course, controversial and plenty, but these are still early days. Yet, the technology is very ideal for creating more data to train and improve the various models that can fast track several menial or mundane tasks that require some form of inferencing to take the next best step forward, and even making fully automotive cars a reality, where an automotive drive simulation model is continually being trained through endless varieties of new environmental data generated and trained virtually, thus building an ever more solid pre-trained model.

Here's more reading on this subject from global consulting firm McKinsey & Company for more insights, and the various industries that stand to gain from generative AI

(And no, this article wasn't churned with generative AI.)

Enter NVIDIA AI Foundations: Enabling companies to create in-house custom generative AI models

So now that we know why Generative AI is so valuable and important, that brings us to NVIDIA’s big push to support enterprises with cloud services to create their own customized large language models (LLMs, which ChatGPT is a prime example) and visual generation models for AI applications. More specifically, these custom generative AI models are developed and trained with the company’s own proprietary data for their unique domain-specific offerings.

(Image Source: NVIDIA)

(Image Source: NVIDIA)

This is made possible with NVIDIA AI Foundations, which is a set of cloud services to enable businesses to build, refine and operate such LLMs and generative AI models.

  • NVIDA NeMo cloud service enables developers to make large language models (LLMs) more relevant for businesses by defining areas of focus, adding domain-specific knowledge and teaching functional skills.

     
  • NVIDIA Picasso is a cloud service for building and deploying generative AI-powered image, video and 3D applications with advanced text-to-image, text-to-video and text-to-3D capabilities to supercharge productivity for creativity, design and digital simulation through simple cloud APIs.

     
  • NVIDIA BioNeMo is a new cloud service that debuted today to accelerate life science research, drug discovery, protein engineering and research in the fields of genomics, biology, chemistry and modular dynamics.

These services run on NVIDIA DGX Cloud, which is accessible via a browser. They are currently available to early-access customers and are in the private preview stage. Developers can use these models offered on each service through simple APIs and when the models are ready for deployment, enterprises can run inference workloads at scale using the NVIDIA AI Foundations cloud services.

Industry Leaders team up with NVIDIA to advance productivity for creative professionals

Adobe, Getty Images, Shutterstock, and Morningstar are among the companies creating AI models, applications, and services with the newly announced NVIDIA AI Foundations.

Adobe, Getty Images, Shutterstock, and Morningstar are among the companies creating AI models, applications, and services with the newly announced NVIDIA AI Foundations.

Adobe today announced they will expand their longstanding research and development partnership to create the next generation generative AI models with NVIDIA. To accelerate the workflows of creators and marketers, some of these models will be jointly developed and brought to market through Adobe Creative Cloud flagship products like Photoshop, Premiere Pro and After Effects, as well as through NVIDIA Picasso.

NVIDIA and Getty Images are collaborating to train responsible generative text-to-image and text-to-video foundation models. The models will allow the creation of images and video using simple text prompts and will be trained on Getty Images’ fully licensed assets.

NVIDIA and Shutterstock are collaborating to train a generative text-to-3D foundation model using the NVIDIA Picasso service to simplify the creation of detailed 3D models and reduce the time required to build 3D models from days to minutes.

New GPUs power Inference Platforms to tackle various Generative AI workloads

(Image source: NVIDIA)

(Image source: NVIDIA)

To augment NVIDIA’s push to help create new and emerging custom generative AI models via NVIDIA Foundation cloud services, they’ve also launched a slew of new GPUs and platforms to help developers build and power these new AI applications based on the NVIDIA Ada Lovelace, Hopper and Grace Hopper processors.

The rise of generative AI is requiring more powerful inference computing platforms. The number of applications for generative AI is infinite, limited only by human imagination. Arming developers with the most powerful and flexible inference computing platform will accelerate the creation of new services that will improve our lives in ways not yet imaginable. – Jensen Huang, Founder and CEO of NVIDIA. 

1) NVIDIA L4 for AI Video

(Image souce: NVIDIA)

(Image souce: NVIDIA)

The new NVIDIA L4 is the direct replacement to the popular T4 GPU, which was the first to use Tensor Cores and designed expressly for AI inferencing workloads to analyze novel data inputs to predict and estimate a desired outcome based on pre-trained models.

The T4 was powered by the Turing microarchitecture, which was the first to support and accelerate ray traced workloads. The new L4, based on the Ada Lovelace GPU architecture (this is what powers the GeForce RTX 40 series) supporting AI-powered DLSS 3 is rated to deliver over 4x speedup in real-time rendering performance over Omniverse, and is able to dish out 3x higher ray-traced performance.

With this enhanced throughput, the L4 GPU is positioned for AI video workloads for tackle real-time video decoding, transcoding, video content moderation, language translation, video call enhancement features such as background replacement, relighting, eye contact, augmented reality and more. The new GPU’s dual AV1 encoders are also excellent reasons why the L4 is ideal for these AI video tasks. In fact, a single 8-GPU L4 server can replace over a hundred traditional dual-socket CPU servers in processing AI video. This is a massive savings in total cost of ownership over older infrastructures.

Better yet, the L4 is also designed in the same low profile form factor and a similar 72W power envelope, which makes upgrading existing T4 powered servers with an L4 a breeze, while improving AI inferencing prowess by a good margin.

Graphics Card
L4
T4
GPU
Ada Lovelace
Turing (TU104)
Process
4nm (TSMC)
12nm FinFET (TSMC)
CUDA cores
TBD
2560
Tensor Cores
Yes (4th Gen)
320 (2nd Gen)
Tensor Performance1 (FP16)
242 TFLOPS
65 TFLOPS
RT Cores
Yes (Gen 3)
40 (Gen 1)
RT Performance
2x of T4
TBD
GPU base / boost clock speeds
795MHz / 2040MHz
585MHz/ 1590MHz
Memory
24GB GDDR6 with ECC
16GB GDDR6 with ECC
Memory clock speed
6,251MHz
5,000MHz
Memory bus width
192-bit
256-bit
Memory bandwidth
300GB/s
320GB/s
Interface
PCIe4.0 x16
PCIe 3.0 x16
Form Factor
1-slot, Low Profile
1-slot, Low Profile
TDP
72W
70W

1. Effective Tensor performance with and without using the Sparsity feature.

In fact, one of the first deployments of the NVIDIA L4 is in Google's Cloud, offering it up as their G2 compute engine family cloud VM solution offering significant performance improvements on HPC, graphics, video transcoding, in addition to improving performance per dollar value of handling AI inferencing in the cloud to tackle the explosive field of generative AI.

2) L40 for Image Generation

(Image source: NVIDIA)

(Image source: NVIDIA)

The L40 was actually announced in 2022, but it wasn’t until recently that it saw some action. Based on the Ada Lovelace RTX GPU with over 18,000 CUDA processing cores 142 RT Cores, the L40 packs quite a punch as these specs place it well ahead of what the RTX 4090 packs. But unlike the RTX 4090 that’s optimized for high clock speeds, rasterization and ray-traced performance with active cooling and a higher power budget, the L40 is a passively cooled design with a 300W TDP and is meant to take advantage of the airflow paths designed within rack servers.

Graphics Card
L40
RTX 6000 Ada Generation
RTX4090
A40
Class
Data Centre
Professional
Consumer
Data Centre
GPU
Ada Lovelace (AD102)
Ada Lovelace (AD102)
Ada Lovelace (AD102)
Ampere (GA102)

Process

4nm (TSMC)
8nm (Samsung)
Transistors
76 billion
76 billion
76 billion
28 billion
Streaming Multi-processors (SM)
142
142
128
84
CUDA cores
18176
18176
16384
10752
Tensor Cores
568 (Gen 4)
568 (Gen 4)
512 (Gen 4)
336 (Gen 3)
Tensor Performance1 (FP16)
362.1TFLOPS
TBD
TBD
299.4 TFLOPS
RT Cores
142 (Gen 3)
142 (Gen 3)
128 (Gen 3)
84 (Gen 2)
RT Performance
209TFLOPS
210TFLOPS
TBD
58 - 75.62 TFLOPS
GPU base / boost clocks (MHz)
735 / 2490
TBD
2230 / 2520
1305 / 1740
Memory
48GB GDDR6X with ECC
48GB GDDR6X with ECC
24GB GDDR6X
48GB GDDR6 with ECC
Memory bus width
384-bit
384-bit
384-bit
384-bit
Memory bandwidth
864GB/s
960GB/s
1,018GB/s
696GB/s
Interface
PCIe 4.0 x16
PCIe 4.0 x16
NVLink
No
Yes
TDP
300W
300W
450W
300W
Price (at launch)
--
US$6,800
US$1,599
--
(Image source: NVIDIA)

(Image source: NVIDIA)

1. Effective Tensor performance with and without using the Sparsity feature.

2. Peak rates based on GPU Boost Clock
.

The L40 also packs 48GB of GDDR6 memory with ECC, perfect for Omniverse Enterprise, rendering, 3D graphics, NVIDIA RTX virtual workstation, AI training and data science. In fact, it’s the backbone of the NVIDIA OVX Server that’s meant for building large-scale Omniverse digital twins.

3) H100 NVL for large language model (LLM) deployment

Note the dual-card NVLink'ed H100 NVL pair, and there are four of them in this server for illustration. (Image source: NVIDIA)

Note the dual-card NVLink'ed H100 NVL pair, and there are four of them in this server for illustration. (Image source: NVIDIA)

The H100 based Hopper GPU architecture is an awesome product that’s focused for data center AI acceleration as it foregoes RT Cores and packs in a far more speedier memory interface to connect with HBM memory.  As fast as the H100 is, NVIDIA is already aware that it needs to do more now to be the driver powering AI generative services like ChatGPT at scale. At GTC 2023, NVIDIA announced the dual PCIe card based H100 NVL that are NVLink’ed to each other. To make it more ideal than two existing H100 PCIe products (they pack 80GB of memory), the new H100 NVL packs in 94GB each, for a grand total of 188GB graphics memory, boasting 7.8TB/s graphics memory bandwidth. Additionally, the GPU configuration of the H100NVL is identical to the H100 SXM SKU, thus the H100 NVL is much faster than the H100 PCIe, even if the latter were to be NVLink’ed.

Graphics Card
H100 NVL
H100 SXM5
H100PCIe
GPU
Hopper (GH100)x 2
Hopper (GH100)
Hopper (GH100)
Process
4N (TSMC)
4N (TSMC)
4N (TSMC)
FP32 performance
134 TFLOPS
67 TFLOPS
51 TFLOPS
FP16 Tensor Performance1
3,958 TFLOPS
1,979 TFLOPS
1,513 TFLOPS
GPU boost clock speeds
TBD
TBD
TBD
GPU Memory
188GB HBM3 (94GB x 2)
80GBHBM3
80GB HBM2e
Memory clock speed
TBD
TBD
TBD
Memory bus width
TBD
5120-bit
5120-bit
Memory bandwidth
7.8TB/s
3.35TB/s
2TB/s
Interconnect
3rd-gen NVLinkBridge(600GB/s) +PCIe 5.0
4th-gen NVLink(900GB/s) + PCIe 5.0
3rd-gen NVLink(600GB/s) +PCIe 5.0
GPU board form factor
Dual PCIe 5.0, air-cooled
SXM5
PCIe 5.0, air-cooled
TDP
2x 350-400W (configurable)
700W
350W
Price
--
--
--

1. Effective Tensor performance with and without using the Sparsity feature.

 

According to NVIDIA, the H100 NVL equipped server (with quad H100 NVL) is over 10x faster than a HGX A100 server (eight H100 SXM) processing GPT-3. That’s a phenomenal increase in language model processing.

4) NVIDIA Grace Hopper for Recommendation Models

(Image source: NVIDIA YouTube)

(Image source: NVIDIA YouTube)

Lastly, NVIDIA also has the Grace Hopper super chip to process giant data sets in AI databases and graph recommendation models, where the module’s super fast and low-latency chip-to-chip NVLink-C2C enables over 900GB/s interconnect bandwidth between the ARM-based Grace chip and the Hopper GPU. This allows a giant query to be processed on the CPU and then be immediately transferred over to the Hopper GPU for inference processing that’s over seven times faster than PCI Express 5.0.

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.

Share this article