NVIDIA unveils Tesla V100 to target AI acceleration with crazy 5,120 CUDA-core based Volta GPU
NVIDIA Tesla V100 Volta-based GPU first to target AI acceleration with 12x increase in deep learning performance
If the Tesla P100 that debuted in GTC 2016 wasn’t already a monster GPU targeted for the datacenter, this year’s debut, the Tesla V100 and its 21 billion transistors, 5,120 CUDA cores using NVIDIA’s new Volta GPU architecture will surely send your jaw to the floor! The Tesla V100 is the culmination of NVIDIA's US$3 billion investment and commitment to the burgeoning interests in AI and deep learning.
Why do we need such a crazy GPU?
We are now at the juncture that NVIDIA calls the Big Bang of Modern AI. The systems (AI or deep learning algorithms) that are in place today are able to process photos to identify and classify objects with raw data on its own, sense perception through raw sensor data, speech recognition (to a certain degree), caption videos automatically, robots learning through computer vision, some levels of natural language translation and the list goes on.
Having said that, AI enabled systems aren’t pervasive enough and we’ve yet to touch on Smart Cities and AI Cities that tech giants like NVIDIA want to fulfill. To move ahead, we need to be able to tackle far more complex and larger neural network models – in a timely manner. Only then can we unlock the next class of AI services. The insurance and financial markets are a great example where AI could revolutionize several processes, but that will definitely require very complex handling models.
This is why the Tesla V100 and its Volta GPU architecture were made - to address the rapid growth and interest in AI, process new and larger deep learning models, the explosive growth of GPU cloud computing, and help make AI Cities a reality.
Highlights of the Tesla V100 and Volta GPU architecture
The Tesla V100 is NVIDIA’s first GPU product to feature the Volta architecture and is the first to be fabricated on TSMC’s 12nm process technology that’s customized for NVIDIA. Many would have expected NVIDIA to make the jump to a 10nm process, but when asked, the company mentioned they had to choose the best possible process available for the GPU’s complex design, manufacturability, yield ratio and more. Indeed those are important factors because the GV100 GPU (codename for the part used on the Tesla V100) packs a staggering 21.1 billion transistors that has a massive die size of 815mm2. On the upside, Tesla V100 packs a whole lot more compute power and new features than the Tesla P100 predecessor and still maintains the same 300W TDP. As such, it delivers exceptional performance per watt.
Some of the key features are as follows:-
- New Streaming Multiprocessor (SM) architecture optimized for deep learning: The new Volta SM is about 50% more energy efficient than the previous Pascal SM to enable Volta to deliver a far higher performance throughput at the same power draw. Besides the new manufacturing process, several design advancements contributed to this such as:-
- Independent thread scheduling that provides a lot more flexible programming and execution capabilities
- New combined L1 data cache and shared memory subsystem for higher performance and lower latency
- Streamlined instruction set for simpler decoding and reduced instruction latencies
- Separate FP32 and INT32 cores allow the Volta SM to execute both instructions simultaneously (previously not possible)
- New Tensor Core targeted for deep learning workloads (more on this later)
- Higher core clocks and higher power efficiency
Tesla V100 packs a grand total of 80 SMs, 40 TPCs and 5,120 FP32 cores. This compared a lot more favorably to already impressive 56 SMs, 28 TPCs and 3,584 FP32 cores on the Tesla P100.
New mixed-precision FP16/FP32 Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM. In total, the Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. This is a whole lot more than the 10.6 peak FP32 TFLOPS possible on the Tesla P100.
Tensor Cores are really good at executing massively fast matrix multiplication, which benefit AI and deep learning applications tremendously.
Fortunately, the Tensor Cores aren’t only activated by these applications; the updated CUDA framework would be able to use Tensor Cores as it deems fit for other purposes too and it doesn’t have to be explicitly programmed to target it.
Second-Generation NVLink: Delivers twice the throughput of the first generation NVLink for greater scaling to achieve the absolute highest application performance. Rated for up to 300GB/s.
Faster, higher efficiency HBM2 memory: Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth – 50% greater than Pascal GP100.
Volta Optimized Software: New versions of popular deep learning frameworks and GPU accelerated libraries leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications.
Photos up-close of the Tesla V100
When is it available?
Tesla Volta based products by OEMs are slated to come to market in Q4 2017. NVIDIA’s very own supercomputer, the DGX-1, will be the first sport the Tesla V100 from Q3 this year.
Meanwhile, NVIDIA CEO Jen-Hsun Huang did mention that customers who continue to purchase a DGX-1 now with the Pascal based Tesla P100, will receive a free upgrade to the Tesla V100 GPUs when they become available. After all, that’s the least they could offer after debuting a successor and raising the price of the DGX-1 to an eye-tearing US$149,000.