NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload

NVIDIA Foundations help businesses create custom 'ChatGPT' models through proprietary data while the L4, L40, H100 NVL and Grace Hopper super chip will help build and power these services at every scale.

By Vijay Anand - 22 Mar 2023

Generative AI is a big deal now and NVIDIA is right at the center of it. (Image Source: NVIDIA)

Why is Generative AI all the fuss now?

Generative AI like Bing Ai, ChatGPT and more have recently come into the spotlight for using advanced algorithms to generate new data, visuals and more that look, feel and read as if it was produced by humans. Generative Adversarial Networks (GAN) are ideal for creating visual content, while Generative Pre-Trained (GPT) language models parse data already available on the internet (or other proprietary data sets supplied) to generate an output such as an answer to a query, all the way to producing entire ‘new’ articles. The use of AI for these aspects isn’t new, but breakthroughs in how it understands these queries and produces output that is far more usable, legible, and relatable to the average user are what make generative AI a potent tool.

The concerns are, of course, controversial and plenty, but these are still early days. Yet, the technology is very ideal for creating more data to train and improve the various models that can fast track several menial or mundane tasks that require some form of inferencing to take the next best step forward, and even making fully automotive cars a reality, where an automotive drive simulation model is continually being trained through endless varieties of new environmental data generated and trained virtually, thus building an ever more solid pre-trained model.

Here's more reading on this subject from global consulting firm McKinsey & Company for more insights, and the various industries that stand to gain from generative AI.

(And no, this article wasn't churned with generative AI.)

Enter NVIDIA AI Foundations: Enabling companies to create in-house custom generative AI models

So now that we know why Generative AI is so valuable and important, that brings us to NVIDIA’s big push to support enterprises with cloud services to create their own customized large language models (LLMs, which ChatGPT is a prime example) and visual generation models for AI applications. More specifically, these custom generative AI models are developed and trained with the company’s own proprietary data for their unique domain-specific offerings.

This is made possible with NVIDIA AI Foundations, which is a set of cloud services to enable businesses to build, refine and operate such LLMs and generative AI models.

NVIDA NeMo cloud service enables developers to make large language models (LLMs) more relevant for businesses by defining areas of focus, adding domain-specific knowledge and teaching functional skills.
NVIDIA Picasso is a cloud service for building and deploying generative AI-powered image, video and 3D applications with advanced text-to-image, text-to-video and text-to-3D capabilities to supercharge productivity for creativity, design and digital simulation through simple cloud APIs.
NVIDIA BioNeMo is a new cloud service that debuted today to accelerate life science research, drug discovery, protein engineering and research in the fields of genomics, biology, chemistry and modular dynamics.

These services run on NVIDIA DGX Cloud, which is accessible via a browser. They are currently available to early-access customers and are in the private preview stage. Developers can use these models offered on each service through simple APIs and when the models are ready for deployment, enterprises can run inference workloads at scale using the NVIDIA AI Foundations cloud services.

Industry Leaders team up with NVIDIA to advance productivity for creative professionals

Adobe, Getty Images, Shutterstock, and Morningstar are among the companies creating AI models, applications, and services with the newly announced NVIDIA AI Foundations.

Adobe today announced they will expand their longstanding research and development partnership to create the next generation generative AI models with NVIDIA. To accelerate the workflows of creators and marketers, some of these models will be jointly developed and brought to market through Adobe Creative Cloud flagship products like Photoshop, Premiere Pro and After Effects, as well as through NVIDIA Picasso.

NVIDIA and Getty Images are collaborating to train responsible generative text-to-image and text-to-video foundation models. The models will allow the creation of images and video using simple text prompts and will be trained on Getty Images’ fully licensed assets.

NVIDIA and Shutterstock are collaborating to train a generative text-to-3D foundation model using the NVIDIA Picasso service to simplify the creation of detailed 3D models and reduce the time required to build 3D models from days to minutes.

New GPUs power Inference Platforms to tackle various Generative AI workloads

To augment NVIDIA’s push to help create new and emerging custom generative AI models via NVIDIA Foundation cloud services, they’ve also launched a slew of new GPUs and platforms to help developers build and power these new AI applications based on the NVIDIA Ada Lovelace, Hopper and Grace Hopper processors.

The rise of generative AI is requiring more powerful inference computing platforms. The number of applications for generative AI is infinite, limited only by human imagination. Arming developers with the most powerful and flexible inference computing platform will accelerate the creation of new services that will improve our lives in ways not yet imaginable. – Jensen Huang, Founder and CEO of NVIDIA.

1) NVIDIA L4 for AI Video

The new NVIDIA L4 is the direct replacement to the popular T4 GPU, which was the first to use Tensor Cores and designed expressly for AI inferencing workloads to analyze novel data inputs to predict and estimate a desired outcome based on pre-trained models.

The T4 was powered by the Turing microarchitecture, which was the first to support and accelerate ray traced workloads. The new L4, based on the Ada Lovelace GPU architecture (this is what powers the GeForce RTX 40 series) supporting AI-powered DLSS 3 is rated to deliver over 4x speedup in real-time rendering performance over Omniverse, and is able to dish out 3x higher ray-traced performance.

With this enhanced throughput, the L4 GPU is positioned for AI video workloads for tackle real-time video decoding, transcoding, video content moderation, language translation, video call enhancement features such as background replacement, relighting, eye contact, augmented reality and more. The new GPU’s dual AV1 encoders are also excellent reasons why the L4 is ideal for these AI video tasks. In fact, a single 8-GPU L4 server can replace over a hundred traditional dual-socket CPU servers in processing AI video. This is a massive savings in total cost of ownership over older infrastructures.

Better yet, the L4 is also designed in the same low profile form factor and a similar 72W power envelope, which makes upgrading existing T4 powered servers with an L4 a breeze, while improving AI inferencing prowess by a good margin.

Graphics Card	L4	T4
GPU	Ada Lovelace	Turing (TU104)
Process	4nm (TSMC)	12nm FinFET (TSMC)
CUDA cores	TBD	2560
Tensor Cores	Yes (4th Gen)	320 (2nd Gen)
Tensor Performance1 (FP16)	242 TFLOPS	65 TFLOPS
RT Cores	Yes (Gen 3)	40 (Gen 1)
RT Performance	2x of T4	TBD
GPU base / boost clock speeds	795MHz / 2040MHz	585MHz/ 1590MHz
Memory	24GB GDDR6 with ECC	16GB GDDR6 with ECC
Memory clock speed	6,251MHz	5,000MHz
Memory bus width	192-bit	256-bit
Memory bandwidth	300GB/s	320GB/s
Interface	PCIe4.0 x16	PCIe 3.0 x16
Form Factor	1-slot, Low Profile	1-slot, Low Profile
TDP	72W	70W

1. Effective Tensor performance with and without using the Sparsity feature.

In fact, one of the first deployments of the NVIDIA L4 is in Google's Cloud, offering it up as their G2 compute engine family cloud VM solution offering significant performance improvements on HPC, graphics, video transcoding, in addition to improving performance per dollar value of handling AI inferencing in the cloud to tackle the explosive field of generative AI.

2) L40 for Image Generation

The L40 was actually announced in 2022, but it wasn’t until recently that it saw some action. Based on the Ada Lovelace RTX GPU with over 18,000 CUDA processing cores 142 RT Cores, the L40 packs quite a punch as these specs place it well ahead of what the RTX 4090 packs. But unlike the RTX 4090 that’s optimized for high clock speeds, rasterization and ray-traced performance with active cooling and a higher power budget, the L40 is a passively cooled design with a 300W TDP and is meant to take advantage of the airflow paths designed within rack servers.

Graphics Card	L40	RTX 6000 Ada Generation	RTX4090	A40
Class	Data Centre	Professional	Consumer	Data Centre
GPU	Ada Lovelace (AD102)	Ada Lovelace (AD102)	Ada Lovelace (AD102)	Ampere (GA102)
Process	4nm (TSMC)	8nm (Samsung)
Transistors	76 billion	76 billion	76 billion	28 billion
Streaming Multi-processors (SM)	142	142	128	84
CUDA cores	18176	18176	16384	10752
Tensor Cores	568 (Gen 4)	568 (Gen 4)	512 (Gen 4)	336 (Gen 3)
Tensor Performance1 (FP16)	362.1TFLOPS	TBD	TBD	299.4 TFLOPS
RT Cores	142 (Gen 3)	142 (Gen 3)	128 (Gen 3)	84 (Gen 2)
RT Performance	209TFLOPS	210TFLOPS	TBD	58 - 75.62 TFLOPS
GPU base / boost clocks (MHz)	735 / 2490	TBD	2230 / 2520	1305 / 1740
Memory	48GB GDDR6X with ECC	48GB GDDR6X with ECC	24GB GDDR6X	48GB GDDR6 with ECC
Memory bus width	384-bit	384-bit	384-bit	384-bit
Memory bandwidth	864GB/s	960GB/s	1,018GB/s	696GB/s
Interface	PCIe 4.0 x16	PCIe 4.0 x16
NVLink	No	Yes
TDP	300W	300W	450W	300W
Price (at launch)	--	US$6,800	US$1,599	--

1. Effective Tensor performance with and without using the Sparsity feature.

2. Peak rates based on GPU Boost Clock.

The L40 also packs 48GB of GDDR6 memory with ECC, perfect for Omniverse Enterprise, rendering, 3D graphics, NVIDIA RTX virtual workstation, AI training and data science. In fact, it’s the backbone of the NVIDIA OVX Server that’s meant for building large-scale Omniverse digital twins.

3) H100 NVL for large language model (LLM) deployment

Note the dual-card NVLink'ed H100 NVL pair, and there are four of them in this server for illustration. (Image source: NVIDIA)

The H100 based Hopper GPU architecture is an awesome product that’s focused for data center AI acceleration as it foregoes RT Cores and packs in a far more speedier memory interface to connect with HBM memory. As fast as the H100 is, NVIDIA is already aware that it needs to do more now to be the driver powering AI generative services like ChatGPT at scale. At GTC 2023, NVIDIA announced the dual PCIe card based H100 NVL that are NVLink’ed to each other. To make it more ideal than two existing H100 PCIe products (they pack 80GB of memory), the new H100 NVL packs in 94GB each, for a grand total of 188GB graphics memory, boasting 7.8TB/s graphics memory bandwidth. Additionally, the GPU configuration of the H100NVL is identical to the H100 SXM SKU, thus the H100 NVL is much faster than the H100 PCIe, even if the latter were to be NVLink’ed.

Graphics Card	H100 NVL	H100 SXM5	H100PCIe
GPU	Hopper (GH100)x 2	Hopper (GH100)	Hopper (GH100)
Process	4N (TSMC)	4N (TSMC)	4N (TSMC)
FP32 performance	134 TFLOPS	67 TFLOPS	51 TFLOPS
FP16 Tensor Performance1	3,958 TFLOPS	1,979 TFLOPS	1,513 TFLOPS
GPU boost clock speeds	TBD	TBD	TBD
GPU Memory	188GB HBM3 (94GB x 2)	80GBHBM3	80GB HBM2e
Memory clock speed	TBD	TBD	TBD
Memory bus width	TBD	5120-bit	5120-bit
Memory bandwidth	7.8TB/s	3.35TB/s	2TB/s
Interconnect	3rd-gen NVLinkBridge(600GB/s) +PCIe 5.0	4th-gen NVLink(900GB/s) + PCIe 5.0	3rd-gen NVLink(600GB/s) +PCIe 5.0
GPU board form factor	Dual PCIe 5.0, air-cooled	SXM5	PCIe 5.0, air-cooled
TDP	2x 350-400W (configurable)	700W	350W
Price	--	--	--

1. Effective Tensor performance with and without using the Sparsity feature.

According to NVIDIA, the H100 NVL equipped server (with quad H100 NVL) is over 10x faster than a HGX A100 server (eight H100 SXM) processing GPT-3. That’s a phenomenal increase in language model processing.

4) NVIDIA Grace Hopper for Recommendation Models

Lastly, NVIDIA also has the Grace Hopper super chip to process giant data sets in AI databases and graph recommendation models, where the module’s super fast and low-latency chip-to-chip NVLink-C2C enables over 900GB/s interconnect bandwidth between the ARM-based Grace chip and the Hopper GPU. This allows a giant query to be processed on the CPU and then be immediately transferred over to the Hopper GPU for inference processing that’s over seven times faster than PCI Express 5.0.

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.