Subject: Processors | October 1, 2016 - 06:11 PM | Tim Verry
Tagged: xavier, Volta, tegra, SoC, nvidia, machine learning, gpu, drive px 2, deep neural network, deep learning
Earlier this week at its first GTC Europe event in Amsterdam, NVIDIA CEO Jen-Hsun Huang teased a new SoC code-named Xavier that will be used in self-driving cars and feature the company's newest custom ARM CPU cores and Volta GPU. The new chip will begin sampling at the end of 2017 with product releases using the future Tegra (if they keep that name) processor as soon as 2018.
NVIDIA's Xavier is promised to be the successor to the company's Drive PX 2 system which uses two Tegra X2 SoCs and two discrete Pascal MXM GPUs on a single water cooled platform. These claims are even more impressive when considering that NVIDIA is not only promising to replace the four processors but it will reportedly do that at 20W – less than a tenth of the TDP!
The company has not revealed all the nitty-gritty details, but they did tease out a few bits of information. The new processor will feature 7 billion transistors and will be based on a refined 16nm FinFET process while consuming a mere 20W. It can process two 8k HDR video streams and can hit 20 TOPS (NVIDIA's own rating for deep learning int(8) operations).
Specifically, NVIDIA claims that the Xavier SoC will use eight custom ARMv8 (64-bit) CPU cores (it is unclear whether these cores will be a refined Denver architecture or something else) and a GPU based on its upcoming Volta architecture with 512 CUDA cores. Also, in an interesting twist, NVIDIA is including a "Computer Vision Accelerator" on the SoC as well though the company did not go into many details. This bit of silicon may explain how the ~300mm2 die with 7 billion transistors is able to match the 7.2 billion transistor Pascal-based Telsa P4 (2560 CUDA cores) graphics card at deep learning (tera-operations per second) tasks. Of course in addition to the incremental improvements by moving to Volta and a new ARMv8 CPU architectures on a refined 16nm FF+ process.
|Drive PX||Drive PX 2||NVIDIA Xavier||Tesla P4|
|CPU||2 x Tegra X1 (8 x A57 total)||2 x Tegra X2 (8 x A57 + 4 x Denver total)||1 x Xavier SoC (8 x Custom ARM + 1 x CVA)||N/A|
|GPU||2 x Tegra X1 (Maxwell) (512 CUDA cores total||2 x Tegra X2 GPUs + 2 x Pascal GPUs||1 x Xavier SoC GPU (Volta) (512 CUDA Cores)||2560 CUDA Cores (Pascal)|
|TFLOPS||2.3 TFLOPS||8 TFLOPS||?||5.5 TFLOPS|
|DL TOPS||?||24 TOPS||20 TOPS||22 TOPS|
|TDP||~30W (2 x 15W)||250W||20W||up to 75W|
|Process Tech||20nm||16nm FinFET||16nm FinFET+||16nm FinFET|
|Transistors||?||?||7 billion||7.2 billion|
For comparison, the currently available Tesla P4 based on its Pascal architecture has a TDP of up to 75W and is rated at 22 TOPs. This would suggest that Volta is a much more efficient architecture (at least for deep learning and half precision)! I am not sure how NVIDIA is able to match its GP104 with only 512 Volta CUDA cores though their definition of a "core" could have changed and/or the CVA processor may be responsible for closing that gap. Unfortunately, NVIDIA did not disclose what it rates the Xavier at in TFLOPS so it is difficult to compare and it may not match GP104 at higher precision workloads. It could be wholly optimized for int(8) operations rather than floating point performance. Beyond that I will let Scott dive into those particulars once we have more information!
Xavier is more of a teaser than anything and the chip could very well change dramatically and/or not hit the claimed performance targets. Still, it sounds promising and it is always nice to speculate over road maps. It is an intriguing chip and I am ready for more details, especially on the Volta GPU and just what exactly that Computer Vision Accelerator is (and will it be easy to program for?). I am a big fan of the "self-driving car" and I hope that it succeeds. It certainly looks to continue as Tesla, VW, BMW, and other automakers continue to push the envelope of what is possible and plan future cars that will include smart driving assists and even cars that can drive themselves. The more local computing power we can throw at automobiles the better and while massive datacenters can be used to train the neural networks, local hardware to run and make decisions are necessary (you don't want internet latency contributing to the decision of whether to brake or not!).
I hope that NVIDIA's self-proclaimed "AI Supercomputer" turns out to be at least close to the performance they claim! Stay tuned for more information as it gets closer to launch (hopefully more details will emerge at GTC 2017 in the US).
What are your thoughts on Xavier and the whole self-driving car future?
- NVIDIA Teases Xavier, a High-Performance ARM SoC for Drive PX & AI @ AnandTech
- Tegra Related News @ PC Perspective
- Tesla P4 Specifications @ NVIDIA
- CES 2016: NVIDIA Launches DRIVE PX 2 With Dual Pascal GPUs Driving A Deep Neural Network @ PC Perspective
Is Enterprise Ascending Outside of Consumer Viability?
So a couple of weeks have gone by since the Quadro P6000 (update: was announced) and the new Titan X launched. With them, we received a new chip: GP102. Since Fermi, NVIDIA has labeled their GPU designs with a G, followed by a single letter for the architecture (F, K, M, or P for Fermi, Kepler, Maxwell, and Pascal, respectively), which is then followed by a three digit number. The last digit is the most relevant one, however, as it separates designs by their intended size.
Typically, 0 corresponds to a ~550-600mm2 design, which is about as larger of a design that fabrication labs can create without error-prone techniques, like
multiple exposures (update for clarity: trying to precisely overlap multiple designs to form a larger integrated circuit). 4 corresponds to ~300mm2, although GM204 was pretty large at 398mm2, which was likely to increase the core count while remaining on a 28nm process. Higher numbers, like 6 or 7, fill back the lower-end SKUs until NVIDIA essentially stops caring for that generation. So when we moved to Pascal, jumping two whole process nodes, NVIDIA looked at their wristwatches and said “about time to make another 300mm2 part, I guess?”
The GTX 1080 and the GTX 1070 (GP104, 314mm2) were born.
NVIDIA already announced a 600mm2 part, though. The GP100 had 3840 CUDA cores, HBM2 memory, and an ideal ratio of 1:2:4 between FP64:FP32:FP16 performance. (A 64-bit chunk of memory can store one 64-bit value, two 32-bit values, or four 16-bit values, unless the register is attached to logic circuits that, while smaller, don't know how to operate on the data.) This increased ratio, even over Kepler's 1:6 FP64:FP32, is great for GPU compute, but wasted die area for today's (and tomorrow's) games. I'm predicting that it takes the wind out of Intel's sales, as Xeon Phi's 1:2 FP64:FP32 performance ratio is one of its major selling points, leading to its inclusion in many supercomputers.
Despite the HBM2 memory controller supposedly being actually smaller than GDDR5(X), NVIDIA could still save die space while still providing 3840 CUDA cores (despite disabling a few on Titan X). The trade-off is that FP64 and FP16 performance had to decrease dramatically, from 1:2 and 2:1 relative to FP32, all the way down to 1:32 and 1:64. This new design comes in at 471mm2, although it's $200 more expensive than what the 600mm2 products, GK110 and GM200, launched at. Smaller dies provide more products per wafer, and, better, the number of defective chips should be relatively constant.
Anyway, that aside, it puts NVIDIA in an interesting position. Splitting the xx0-class chip into xx0 and xx2 designs allows NVIDIA to lower the cost of their high-end gaming parts, although it cuts out hobbyists who buy a Titan for double-precision compute. More interestingly, it leaves around 150mm2 for AMD to sneak in a design that's FP32-centric, leaving them a potential performance crown.
Image Credit: ExtremeTech
On the other hand, as fabrication node changes are becoming less frequent, it's possible that NVIDIA could be leaving itself room for Volta, too. Last month, it was rumored that NVIDIA would release two architectures at 16nm, in the same way that Maxwell shared 28nm with Kepler. In this case, Volta, on top of whatever other architectural advancements NVIDIA rolls into that design, can also grow a little in size. At that time, TSMC would have better yields, making a 600mm2 design less costly in terms of waste and recovery.
If this is the case, we could see the GPGPU folks receiving a new architecture once every second gaming (and professional graphics) architecture. That is, unless you are a hobbyist. If you are? I would need to be wrong, or NVIDIA would need to somehow bring their enterprise SKU into an affordable price point. The xx0 class seems to have been pushed up and out of viability for consumers.
Or, again, I could just be wrong.
Subject: General Tech | July 21, 2016 - 12:21 PM | Ryan Shrout
Tagged: Wraith, Volta, video, time spy, softbank, riotoro, retroarch, podcast, nvidia, new, kaby lake, Intel, gtx 1060, geforce, asynchronous compute, async compute, arm, apollo lake, amd, 3dmark, 10nm, 1070m, 1060m
PC Perspective Podcast #409 - 07/21/2016
Join us this week as we discuss the GTX 1060 review, controversy surrounding the async compute of 3DMark Time Spy and more!!
The URL for the podcast is: http://pcper.com/podcast - Share with your friends!
- iTunes - Subscribe to the podcast directly through the Store (audio only)
- Google Play - Subscribe to our audio podcast directly through Google Play!
- RSS - Subscribe through your regular RSS reader (audio only)
- MP3 - Direct download link to the MP3 file
This episode of the PC Perspective Podcast is sponsored by Casper!
Hosts: Ryan Shrout, Allyn Malventano, Jeremy Hellstrom, and Josh Walrath
Subject: Graphics Cards | July 16, 2016 - 06:37 PM | Scott Michaud
Tagged: Volta, pascal, nvidia, maxwell, 16nm
For the past few generations, NVIDIA has been roughly trying to release a new architecture with a new process node, and release a refresh the following year. This ran into a hitch as Maxwell was delayed a year, apart from the GTX 750 Ti, and then pushed back to the same 28nm process that Kepler utilized. Pascal caught up with 16nm, although we know that some hard, physical limitations are right around the corner. The lattice spacing for silicon at room temperature is around ~0.5nm, so we're talking about features the size of ~the low 30s of atoms in width.
This rumor claims that NVIDIA is not trying to go with 10nm for Volta. Instead, it will take place on the same, 16nm node that Pascal is currently occupying. This is quite interesting, because GPUs scale quite well with complexity changes, as they have many features with a relatively low clock rate, so the only real ways to increase performance are to make the existing architecture more efficient, or make a larger chip.
That said, GP100 leaves a lot of room on the table for an FP32-optimized, ~600mm2 part to crush its performance at the high end, similar to how GM200 replaced GK110. The rumored GP102, expected in the ~450mm2 range for Titan or GTX 1080 Ti-style parts, has some room to grow. Like GM200, however, it would also be unappealing to GPU compute users who need FP64. If this is what is going on, and we're totally just speculating at the moment, it would signal that enterprise customers should expect a new GPGPU card every second gaming generation.
That is, of course, unless NVIDIA recognized ways to make the Maxwell-based architecture significantly more die-space efficient in Volta. Clocks could get higher, or the circuits themselves could get simpler. You would think that, especially in the latter case, they would have integrated those ideas into Maxwell and Pascal, though; but, like HBM2 memory, there might have been a reason why they couldn't.
We'll need to wait and see. The entire rumor could be crap, who knows?
Subject: General Tech, Systems | November 27, 2014 - 08:53 PM | Scott Michaud
Tagged: nvidia, IBM, power9, Volta
The Oak Ridge National Laboratory has been interested in a successor for their Titan Supercomputer. Sponsored by the US Department of Energy, the new computer will be based on NVIDIA's Volta (GPU) and IBM's POWER9 (CPU) architectures. Its official name will be “Summit”, and it will have a little sibling, “Sierra”. Sierra, also based on Volta and POWER9, will be installed at the Lawrence Livermore National Laboratory.
Image Credit: NVIDIA
The main feature of these supercomputers is expected to be “NVLink”, which is said to allow unified memory between CPU and GPU. This means that, if you have a workload that alternates rapidly between serial and parallel tasks, that you can save the lag in transferring memory between each switch. One example of this would be a series of for-each loops on a large data set with a bit of logic, checks, and conditional branches between. Memory management is like a lag between each chunk of work, especially across two banks of memory attached by a slow bus.
Summit and Sierra are both built by IBM, while Titan, Oak Ridge's previous supercomputer, was developed by Cray. Not much is known about the specifics of Sierra, but Summit will be about 5x-10x faster (peak computational throughput) than its predecessor at less than a fifth of the nodes. Despite the fewer nodes, it will suck down more total power (~10MW, up from Titan's ~9MW).
These two supercomputers are worth $325 million USD (combined). They are expected to go online in 2017. According to Reuters, an additional $100 million USD will go toward research into "extreme" supercomputing.
Subject: General Tech | June 19, 2013 - 09:51 PM | Josh Walrath
Tagged: Volta, nvidia, maxwell, licensing, kepler, Denver, Blogs, arm
Yesterday we all saw the blog piece from NVIDIA that stated that they were going to start licensing their IP to interested third parties. Obviously, there was a lot of discussion about this particular move. Some were in favor, some were opposed, and others yet thought that NVIDIA is now simply roadkill. I believe that it is an interesting move, but we are not yet sure of the exact details or the repercussions of such a decision on NVIDIA’s part.
The biggest bombshell of the entire post was that NVIDIA would be licensing out their latest architecture to interested clients. The Kepler architecture powers the very latest GTX 700 series of cards and at the top end it is considered one of the fastest and most efficient architectures out there. Seemingly, there is a price for this though. Time to dig a little deeper.
Kepler will be the first technology licensed to third party manufacturers. We will not see full GPUs, these will only be integrated into mobile products.
The very latest Tegra parts from NVIDIA do not feature the Kepler architecture for the graphics portion. Instead, the units featured in Tegra can almost be described as GeForce 7000 series parts. The computational units are split between pixel shaders and vertex shaders. They support a maximum compatibility of D3D 9_3 and OpenGL ES 2.0. This is a far cry from a unified shader architecture and support for the latest D3D 11 and OpenGL ES 3.0 specifications. Other mobile units feature the latest Mali and Adreno series of graphics units which are unified and support DX11 and OpenGL ES 3.0.
So why exactly does the latest Tegras not share the Kepler architecture? Hard to say. It could be a variety of factors that include time to market, available engineering teams, and simulations which could dictate if power and performance can be better served by a less complex unit. Kepler is not simple. A Kepler unit that occupies the same die space could potentially consume more power with any given workload, or conversely it could perform poorly given the same power envelope.
We can look at the desktop side of this argument for some kind of proof. At the top end Kepler is a champ. The GTX 680/770 has outstanding performance and consumes far less power than the competition from AMD. When we move down a notch and see the GTX 660 Ti/HD 7800 series of cards, we see much greater parity in performance and power consumptions. Going to the HD 7790 as compared to the 650 Ti Boost, we see the Boost part have slightly better performance but consumes significantly more power. Then we move down to the 650 and 650 Ti and these parts do not consume any more power than the competing AMD parts, but they also perform much more poorly. I know these are some pretty hefty generalizations and the engineers at NVIDIA could very effectively port Kepler over to mobile applications without significant performance or power penalties. But so far, we have not seen this work.
Power, performance, and die area aside there is also another issue to factor in. NVIDIA just announced that they are doing this. We have no idea how long this effort has been going, but it is very likely that it has only been worked on for the past six months. In that time NVIDIA needs to hammer out how they are going to license the technology, how much manpower they must provide licensees to get those parts up and running, and what kind of fees they are going to charge. There is a lot of work going on there and this is not a simple undertaking.
So let us assume that some three months ago an interested partner such as Rockchip or Samsung comes knocking to NVIDIA’s door. They work out the licensing agreements and this takes several months. Then we start to see the transfer of technology between the companies. Obviously Samsung and Rockchip are not going to apply this graphics architecture to currently shipping products, but will instead bundle it in with a next generation ARM based design. These designs are not spun out overnight. For example, the 64 bit ARMv8 designs have been finalized for around a year, and we do not expect to see initial parts being shipped until late 1H 2014. So any partner that decides to utilize NVIDIA’s Kepler architecture for such an application will not see this part be released until 1H 2015 at the very earliest.
Sheild is still based on a GPU posessing separate pixel and vertex shaders. DX11 and OpenGL ES 3.0? Nope!
If someone decides to license this technology from NVIDIA, it will not be of great concern. The next generation of NVIDIA graphics will already be out by that time, and we could very well be approaching the next iteration for the desktop side. NVIDIA plans on releasing a Kepler based mobile unit in 2014 (Logan), which would be a full year in advance of any competing product. In 2015 NVIDIA is planning on releasing an ARM product based on the Denver CPU and Maxwell GPU. So we can easily see that NVIDIA will only be licensing out an older generation product so it will not face direct competition when it comes to GPUs. NVIDIA obviously is hoping that their GPU tech will still be a step ahead of that of ARM (Mali), Qualcomm (Adreno), and Imagination Technologies (PowerVR).
This is an easy and relatively painfree way to test the waters that ARM, Imagination Technologies, and AMD are already treading. ARM only licenses IP and have shown the world that it can not only succeed at it, but thrive. Imagination Tech used to produce their own chips much like NVIDIA does, but they changed direction and continue to be profitable. AMD recently opened up about their semi-custom design group that will design specific products for customers and then license those designs out. I do not think this is a desperation move by NVIDIA, but it certainly is one that probably is a little late in coming. The mobile market is exploding, and we are approaching a time where nearly every electricity based item will have some kind of logic included in it, billions of chips a year will be sold. NVIDIA obviously wants a piece of that market. Even a small piece of “billions” is going to be significant to the bottom line.