NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload
NVIDIA goes big to support Generative AI with cloud services and inference platforms for every workload
Why is Generative AI all the fuss now?
Generative AI like Bing Ai, ChatGPT and more have recently come into the spotlight for using advanced algorithms to generate new data, visuals and more that look, feel and read as if it was produced by humans. Generative Adversarial Networks (GAN) are ideal for creating visual content, while Generative Pre-Trained (GPT) language models parse data already available on the internet (or other proprietary data sets supplied) to generate an output such as an answer to a query, all the way to producing entire ‘new’ articles. The use of AI for these aspects isn’t new, but breakthroughs in how it understands these queries and produces output that is far more usable, legible, and relatable to the average user are what make generative AI a potent tool.
The concerns are, of course, controversial and plenty, but these are still early days. Yet, the technology is very ideal for creating more data to train and improve the various models that can fast track several menial or mundane tasks that require some form of inferencing to take the next best step forward, and even making fully automotive cars a reality, where an automotive drive simulation model is continually being trained through endless varieties of new environmental data generated and trained virtually, thus building an ever more solid pre-trained model.
Here's more reading on this subject from global consulting firm McKinsey & Company for more insights, and the various industries that stand to gain from generative AI.
(And no, this article wasn't churned with generative AI.)
Enter NVIDIA AI Foundations: Enabling companies to create in-house custom generative AI models
So now that we know why Generative AI is so valuable and important, that brings us to NVIDIA’s big push to support enterprises with cloud services to create their own customized large language models (LLMs, which ChatGPT is a prime example) and visual generation models for AI applications. More specifically, these custom generative AI models are developed and trained with the company’s own proprietary data for their unique domain-specific offerings.
This is made possible with NVIDIA AI Foundations, which is a set of cloud services to enable businesses to build, refine and operate such LLMs and generative AI models.
- NVIDA NeMo cloud service enables developers to make large language models (LLMs) more relevant for businesses by defining areas of focus, adding domain-specific knowledge and teaching functional skills.
- NVIDIA Picasso is a cloud service for building and deploying generative AI-powered image, video and 3D applications with advanced text-to-image, text-to-video and text-to-3D capabilities to supercharge productivity for creativity, design and digital simulation through simple cloud APIs.
- NVIDIA BioNeMo is a new cloud service that debuted today to accelerate life science research, drug discovery, protein engineering and research in the fields of genomics, biology, chemistry and modular dynamics.
These services run on NVIDIA DGX Cloud, which is accessible via a browser. They are currently available to early-access customers and are in the private preview stage. Developers can use these models offered on each service through simple APIs and when the models are ready for deployment, enterprises can run inference workloads at scale using the NVIDIA AI Foundations cloud services.
Industry Leaders team up with NVIDIA to advance productivity for creative professionals
Adobe today announced they will expand their longstanding research and development partnership to create the next generation generative AI models with NVIDIA. To accelerate the workflows of creators and marketers, some of these models will be jointly developed and brought to market through Adobe Creative Cloud flagship products like Photoshop, Premiere Pro and After Effects, as well as through NVIDIA Picasso.
NVIDIA and Getty Images are collaborating to train responsible generative text-to-image and text-to-video foundation models. The models will allow the creation of images and video using simple text prompts and will be trained on Getty Images’ fully licensed assets.
NVIDIA and Shutterstock are collaborating to train a generative text-to-3D foundation model using the NVIDIA Picasso service to simplify the creation of detailed 3D models and reduce the time required to build 3D models from days to minutes.
New GPUs power Inference Platforms to tackle various Generative AI workloads
To augment NVIDIA’s push to help create new and emerging custom generative AI models via NVIDIA Foundation cloud services, they’ve also launched a slew of new GPUs and platforms to help developers build and power these new AI applications based on the NVIDIA Ada Lovelace, Hopper and Grace Hopper processors.
The rise of generative AI is requiring more powerful inference computing platforms. The number of applications for generative AI is infinite, limited only by human imagination. Arming developers with the most powerful and flexible inference computing platform will accelerate the creation of new services that will improve our lives in ways not yet imaginable. – Jensen Huang, Founder and CEO of NVIDIA.
1) NVIDIA L4 for AI Video
The new NVIDIA L4 is the direct replacement to the popular T4 GPU, which was the first to use Tensor Cores and designed expressly for AI inferencing workloads to analyze novel data inputs to predict and estimate a desired outcome based on pre-trained models.
The T4 was powered by the Turing microarchitecture, which was the first to support and accelerate ray traced workloads. The new L4, based on the Ada Lovelace GPU architecture (this is what powers the GeForce RTX 40 series) supporting AI-powered DLSS 3 is rated to deliver over 4x speedup in real-time rendering performance over Omniverse, and is able to dish out 3x higher ray-traced performance.
With this enhanced throughput, the L4 GPU is positioned for AI video workloads for tackle real-time video decoding, transcoding, video content moderation, language translation, video call enhancement features such as background replacement, relighting, eye contact, augmented reality and more. The new GPU’s dual AV1 encoders are also excellent reasons why the L4 is ideal for these AI video tasks. In fact, a single 8-GPU L4 server can replace over a hundred traditional dual-socket CPU servers in processing AI video. This is a massive savings in total cost of ownership over older infrastructures.
Better yet, the L4 is also designed in the same low profile form factor and a similar 72W power envelope, which makes upgrading existing T4 powered servers with an L4 a breeze, while improving AI inferencing prowess by a good margin.
Graphics Card | L4 | T4 |
---|---|---|
GPU | Ada Lovelace | Turing (TU104) |
Process | 4nm (TSMC) |
12nm FinFET (TSMC) |
CUDA cores | TBD | 2560 |
Tensor Cores | Yes (4th Gen) |
320 (2nd Gen) |
Tensor Performance 1 (FP16) | 242 TFLOPS | 65 TFLOPS |
RT Cores | Yes (Gen 3) |
40 (Gen 1) |
RT Performance | 2x of T4 | TBD |
GPU base / boost clock speeds | 795MHz / 2040MHz | 585MHz / 1590MHz |
Memory | 24GB GDDR6 with ECC | 16GB GDDR6 with ECC |
Memory clock speed | 6,251MHz | 5,000MHz |
Memory bus width | 192-bit | 256-bit |
Memory bandwidth | 300GB/s | 320GB/s |
Interface | PCIe 4.0 x16 | PCIe 3.0 x16 |
Form Factor | 1-slot, Low Profile | 1-slot, Low Profile |
TDP | 72W | 70W |
1. Effective Tensor performance with and without using the Sparsity feature.
In fact, one of the first deployments of the NVIDIA L4 is in Google's Cloud, offering it up as their G2 compute engine family cloud VM solution offering significant performance improvements on HPC, graphics, video transcoding, in addition to improving performance per dollar value of handling AI inferencing in the cloud to tackle the explosive field of generative AI.
2) L40 for Image Generation
The L40 was actually announced in 2022, but it wasn’t until recently that it saw some action. Based on the Ada Lovelace RTX GPU with over 18,000 CUDA processing cores 142 RT Cores, the L40 packs quite a punch as these specs place it well ahead of what the RTX 4090 packs. But unlike the RTX 4090 that’s optimized for high clock speeds, rasterization and ray-traced performance with active cooling and a higher power budget, the L40 is a passively cooled design with a 300W TDP and is meant to take advantage of the airflow paths designed within rack servers.
Graphics Card | L40 | RTX 6000 Ada Generation | RTX 4090 | A40 |
---|---|---|---|---|
Class | Data Centre | Professional | Consumer | Data Centre |
GPU | Ada Lovelace (AD102) |
Ada Lovelace (AD102) |
Ada Lovelace (AD102) | Ampere (GA102) |
Process |
4nm (TSMC) |
8nm (Samsung) |
||
Transistors | 76 billion | 76 billion | 76 billion | 28 billion |
Streaming Multi-processors (SM) | 142 | 142 | 128 | 84 |
CUDA cores | 18176 | 18176 | 16384 | 10752 |
Tensor Cores | 568 (Gen 4) |
568 (Gen 4) |
512 (Gen 4) |
336 (Gen 3) |
Tensor Performance 1 (FP16) | 362.1 TFLOPS | TBD | TBD | 299.4 TFLOPS |
RT Cores | 142 (Gen 3) |
142 (Gen 3) |
128 (Gen 3) |
84 (Gen 2) |
RT Performance | 209 TFLOPS | 210 TFLOPS | TBD | 58 - 75.62 TFLOPS |
GPU base / boost clocks (MHz) | 735 / 2490 | TBD | 2230 / 2520 | 1305 / 1740 |
Memory | 48GB GDDR6X with ECC | 48GB GDDR6X with ECC | 24GB GDDR6X | 48GB GDDR6 with ECC |
Memory bus width | 384-bit | 384-bit | 384-bit | 384-bit |
Memory bandwidth | 864GB/s | 960GB/s | 1,018GB/s | 696GB/s |
Interface | PCIe 4.0 x16 | PCIe 4.0 x16 | ||
NVLink | No | Yes | ||
TDP | 300W | 300W | 450W | 300W |
Price (at launch) | -- | US$6,800 | US$1,599 | -- |
1. Effective Tensor performance with and without using the Sparsity feature.
2. Peak rates based on GPU Boost Clock.
The L40 also packs 48GB of GDDR6 memory with ECC, perfect for Omniverse Enterprise, rendering, 3D graphics, NVIDIA RTX virtual workstation, AI training and data science. In fact, it’s the backbone of the NVIDIA OVX Server that’s meant for building large-scale Omniverse digital twins.
3) H100 NVL for large language model (LLM) deployment
The H100 based Hopper GPU architecture is an awesome product that’s focused for data center AI acceleration as it foregoes RT Cores and packs in a far more speedier memory interface to connect with HBM memory. As fast as the H100 is, NVIDIA is already aware that it needs to do more now to be the driver powering AI generative services like ChatGPT at scale. At GTC 2023, NVIDIA announced the dual PCIe card based H100 NVL that are NVLink’ed to each other. To make it more ideal than two existing H100 PCIe products (they pack 80GB of memory), the new H100 NVL packs in 94GB each, for a grand total of 188GB graphics memory, boasting 7.8TB/s graphics memory bandwidth. Additionally, the GPU configuration of the H100NVL is identical to the H100 SXM SKU, thus the H100 NVL is much faster than the H100 PCIe, even if the latter were to be NVLink’ed.
Graphics Card | H100 NVL | H100 SXM5 | H100 PCIe |
---|---|---|---|
GPU | Hopper (GH100) x 2 |
Hopper (GH100) |
Hopper (GH100) |
Process | 4N (TSMC) |
4N (TSMC) |
4N (TSMC) |
FP32 performance | 134 TFLOPS | 67 TFLOPS | 51 TFLOPS |
FP16 Tensor Performance 1 | 3,958 TFLOPS | 1,979 TFLOPS | 1,513 TFLOPS |
GPU boost clock speeds | TBD | TBD | TBD |
GPU Memory | 188GB HBM3 (94GB x 2) |
80GB HBM3 | 80GB HBM2e |
Memory clock speed | TBD | TBD | TBD |
Memory bus width | TBD | 5120-bit | 5120-bit |
Memory bandwidth | 7.8TB/s | 3.35TB/s | 2TB/s |
Interconnect | 3rd-gen NVLink Bridge (600GB/s) + PCIe 5.0 |
4th-gen NVLink (900GB/s) + PCIe 5.0 |
3rd-gen NVLink (600GB/s) + PCIe 5.0 |
GPU board form factor | Dual PCIe 5.0, air-cooled | SXM5 | PCIe 5.0, air-cooled |
TDP | 2x 350-400W (configurable) |
700W | 350W |
Price | -- | -- | -- |
1. Effective Tensor performance with and without using the Sparsity feature.
According to NVIDIA, the H100 NVL equipped server (with quad H100 NVL) is over 10x faster than a HGX A100 server (eight H100 SXM) processing GPT-3. That’s a phenomenal increase in language model processing.
4) NVIDIA Grace Hopper for Recommendation Models
Lastly, NVIDIA also has the Grace Hopper super chip to process giant data sets in AI databases and graph recommendation models, where the module’s super fast and low-latency chip-to-chip NVLink-C2C enables over 900GB/s interconnect bandwidth between the ARM-based Grace chip and the Hopper GPU. This allows a giant query to be processed on the CPU and then be immediately transferred over to the Hopper GPU for inference processing that’s over seven times faster than PCI Express 5.0.