GTC 2018: Nvidia and ARM Integrating NVDLA Into Project Trillium For Inferencing at the Edge

Subject: General Tech | March 29, 2018 - 03:10 PM |
Tagged: project trillium, nvidia, machine learning, iot, GTC 2018, GTC, deep learning, arm, ai

During GTC 2018 NVIDIA and ARM announced a partnership that will see ARM integrate NVIDIA's NVDLA deep learning inferencing accelerator into the company's Project Trillium machine learning processors. The NVIDIA Deep Learning Accelerator (NVDLA) is an open source modular architecture that is specifically optimized for inferencing operations such as object and voice recognition and bringing that acceleration to the wider ARM ecosystem through Project Trillium will enable a massive number of smarter phones, tablets, Internet-of-Things, and embedded devices that will be able to do inferencing at the edge which is to say without the complexity and latency of having to rely on cloud processing. This means potentially smarter voice assistants (e.g. Alexa, Google), doorbell cameras, lighting, and security around the home and out-and-about on your phone for better AR, natural translation, and assistive technologies.

NVIDIAandARM_NVDLA.jpg

Karl Freund, lead analyst for deep learning at Moor Insights & Strategy was quoted in the press release in stating:

“This is a win/win for IoT, mobile and embedded chip companies looking to design accelerated AI inferencing solutions. NVIDIA is the clear leader in ML training and Arm is the leader in IoT end points, so it makes a lot of sense for them to partner on IP.”

ARM's Project Trillium was announced back in February and is a suite of IP for processors optimized for parallel low latency workloads and includes a Machine Learning processor, Object Detection processor, and neural network software libraries. NVDLA is a hardware and software platform based upon the Xavier SoC that is highly modular and configurable hardware that can feature a convolution core, single data processor, planar data processor, channel data processor, and data reshape engines. The NVDLA can be configured with all or only some of those elements and they can independently them up or down depending on what processing acceleration they need for their devices. NVDLA connects to the main system processor over a control interface and through two AXI memory interfaces (one optional) that connect to system memory and (optionally) dedicated high bandwidth memory (not necessarily HBM but just its own SRAM for example).

arm project trillium integrates NVDLA.jpg

NVDLA is presented as a free and open source architecture that promotes a standard way to design deep learning inferencing that can accelerate operations to infer results from trained neural networks (with the training being done on other devices perhaps by the DGX-2). The project, which hosts the code on GitHub and encourages community contributions, goes beyond the Xavier-based hardware and includes things like drivers, libraries, TensorRT support (upcoming)  for Google's TensorFlow acceleration, testing suites and SDKs as well as a deep learning training infrastructure (for the training side of things) that is compatible with the NVDLA software and hardware, and system integration support.

Bringing the "smarts" of smart devices to the local hardware and closer to the users should mean much better performance and using specialized accelerators will reportedly offer the performance levels needed without blowing away low power budgets. Internet-of-Things (IoT) and mobile devices are not going away any time soon, and the partnership between NVIDIA and ARM should make it easier for developers and chip companies to offer smarter (and please tell me more secure!) smart devices.

Also read:

Source: NVIDIA
Manufacturer: Microsoft

It's all fun and games until something something AI.

Microsoft announced the Windows Machine Learning (WinML) API about two weeks ago, but they did so in a sort-of abstract context. This week, alongside the 2018 Game Developers Conference, they are grounding it in a practical application: video games!

microsoft-2018-winml-graphic.png

Specifically, the API provides the mechanisms for game developers to run inference on the target machine. The training data that it runs against would be in the Open Neural Network Exchange (ONNX) format from Microsoft, Facebook, and Amazon. Like the initial announcement suggests, it can be used for any application, not just games, but… you know. If you want to get a technology off the ground, and it requires a high-end GPU, then video game enthusiasts are good lead users. When run in a DirectX application, WinML kernels are queued on the DirectX 12 compute queue.

We’ve discussed the concept before. When you’re rendering a video game, simulating an accurate scenario isn’t your goal – the goal is to look like you are. The direct way of looking like you’re doing something is to do it. The problem is that some effects are too slow (or, sometimes, too complicated) to correctly simulate. In these cases, it might be viable to make a deep-learning AI hallucinate a convincing result, even though no actual simulation took place.

Fluid dynamics, global illumination, and up-scaling are three examples.

Previously mentioned SIGGRAPH demo of fluid simulation without fluid simulation...
... just a trained AI hallucinating a scene based on input parameters.

Another place where AI could be useful is… well… AI. One way of making AI is to give it some set of data from the game environment, often including information that a player in its position would not be able to know, and having it run against a branching logic tree. Deep learning, on the other hand, can train itself on billions of examples of good and bad play, and make results based on input parameters. While the two methods do not sound that different, the difference between logic being designed (vs logic being assembled from an abstract good/bad dataset) someone abstracts the potential for assumptions and programmer error. Of course, it abstracts that potential for error into the training dataset, but that’s a whole other discussion.

The third area that AI could be useful is when you’re creating the game itself.

There’s a lot of grunt and grind work when developing a video game. Licensing prefab solutions (or commissioning someone to do a one-off asset for you) helps ease this burden, but that gets expensive in terms of both time and money. If some of those assets could be created by giving parameters to a deep-learning AI, then those are assets that you would not need to make, allowing you to focus on other assets and how they all fit together.

These are three of the use cases that Microsoft is aiming WinML at.

nvidia-2018-deeplearningcarupscale.png

Sure, these are smooth curves of large details, but the antialiasing pattern looks almost perfect.

For instance, Microsoft is pointing to an NVIDIA demo where they up-sample a photo of a car, once with bilinear filtering and once with a machine learning algorithm (although not WinML-based). The bilinear algorithm behaves exactly as someone who has used Photoshop would expect. The machine learning algorithm, however, was able to identify the objects that the image intended to represent, and it drew the edges that it thought made sense.

microsoft-2018-gdc-PIX.png

Like their DirectX Raytracing (DXR) announcement, Microsoft plans to have PIX support WinML “on Day 1”. As for partners? They are currently working with Unity Technologies to provide WinML support in Unity’s ML-Agents plug-in. That’s all the game industry partners they have announced at the moment, though. It’ll be interesting to see who jumps in and who doesn’t over the next couple of years.

AMD, a little too far ahead of the curve again?

Subject: General Tech | December 27, 2017 - 11:42 AM |
Tagged: nvidia, Intel, HBM2, deep learning

AMD has never been afraid to try new things, from hitting 1GHz first, to creating a true multicore processor, most recently adopting HBM and HBM2 into their graphics cards.  That move contributed to some of their recent difficulties with the current generation of GPUs; HBM is more expensive to produce and more of a challenge to implement.  While they were the first to implement HBM, it is NVIDIA and Intel which are benefiting from AMD's experimental nature.  Their new generation of HPC solutions, the Tesla P100, Quadro GP 100 and Lake Crest all use HBM2 and benefit from the experience Hynix, Samsung and TSMC gained fabbing the first generation.  Vega products offer slightly less memory bandwidth as well as lagging behind in overall performance, a drawback to being first.

On a positive note, AMD have now had more experience designing chips which make use of HBM and this could offer a new hope for the next generation of cards, both gaming and HPC flavours.  DigiTimes briefly covers the two processes manufacturers use in the production of HBM here.

_id1460366655_343178_1.jpg

"However, Intel's release of its deep-learning chip, Lake Crest, which came following its acquisition of Nervana, has come with HMB2. This indicates that HBM-based architecture will be the main development direction of memory solutions for HPC solutions by GPU vendors."

Here is some more Tech News from around the web:

Tech Talk

 

Source: DigiTimes

How deep is your learning?

Recently, we've had some hands-on time with NVIDIA's new TITAN V graphics card. Equipped with the GV100 GPU, the TITAN V has shown us some impressive results in both gaming and GPGPU compute workloads.

However, one of the most interesting areas that NVIDIA has been touting for GV100 has been deep learning. With a 1.33x increase in single-precision FP32 compute over the Titan Xp, and the addition of specialized Tensor Cores for deep learning, the TITAN V is well positioned for deep learning workflows.

In mathematics, a tensor is a multi-dimensional array of numerical values with respect to a given basis. While we won't go deep into the math behind it, Tensors are a crucial data structure for deep learning applications.

07.jpg

NVIDIA's Tensor Cores aim to accelerate Tensor-based math by utilizing half-precision FP16 math in order to process both dimensions of a Tensor at the same time. The GV100 GPU contains 640 of these Tensor Cores to accelerate FP16 neural network training.

It's worth noting that these are not the first Tensor operation-specific hardware, with others such as Google developing hardware for these specific functions.

Test Setup

  PC Perspective Deep Learning Testbed
Processor AMD Ryzen Threadripper 1920X
Motherboard GIGABYTE X399 AORUS Gaming 7
Memory 64GB Corsair Vengeance RGB DDR4-3000 
Storage Samsung SSD 960 Pro 2TB
Power Supply Corsair AX1500i 1500 watt
OS Ubuntu 16.04.3 LTS
Drivers AMD: AMD GPU Pro 17.50
NVIDIA: 387.34

For our NVIDIA testing, we used the NVIDIA GPU Cloud 17.12 Docker containers for both TensorFlow and Caffe2 inside of our Ubuntu 16.04.3 host operating system.

AMD testing was done using the hiptensorflow port from the AMD ROCm GitHub repositories.

For all tests, we are using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) data set.

Continue reading our look at deep learning performance with the NVIDIA Titan V!!

Intel Sheds More Light On Benefits of Nervana Neural Network Processor

Subject: General Tech, Processors | December 12, 2017 - 04:52 PM |
Tagged: training, nnp, nervana, Intel, flexpoint, deep learning, asic, artificial intelligence

Intel recently provided a few insights into its upcoming Nervana Neural Network Processor (NNP) on its blog. Built in partnership with deep learning startup Nervana Systems which Intel acquired last year for over $400 million, the AI-focused chip previously codenamed Lake Crest is built on a new architecture designed from the ground up to accelerate neural network training and AI modeling.

new_nervana_chip-fb.jpg

The full details of the Intel NNP are still unknown, but it is a custom ASIC with a Tensor-based architecture placed on a multi-chip module (MCM) along with 32GB of HBM2 memory. The Nervana NNP supports optimized and power efficient Flexpoint math and interconnectivity is huge on this scalable platform. Each AI accelerator features 12 processing clusters (with an as-yet-unannounced number of "cores" or processing elements) paired with 12 proprietary inter-chip links that 20-times faster than PCI-E, four HBM2 memory controllers, a management-controller CPU, as well as standard SPI, I2C, GPIO, PCI-E x16, and DMA I/O. The processor is designed to be highly configurable and to meet both mode and data parallelism goals.

The processing elements are all software controlled and can communicate with each other using high speed bi-directional links at up to a terabit per second. Each processing element has more than 2MB of local memory and the Nervana NNP has 30MB in total of local memory. Memory accesses and data sharing is managed with QOS software which controls adjustable bandwidth over multiple virtual channels with multiple priorities per channel. Processing elements can talk to and send/receive data between each other and the HBM2 stacks locally as well as off die to processing elements and HBM2 on other NNP chips. The idea is to allow as much internal sharing as possible and to keep as much data stored and transformed in local data as possible in order to save precious HBM2 bandwidth (1TB/s) for pre-fetching upcoming tensors, reduce the number of hops and resulting latency by not having to go out to the HBM2 memory and back to transfer data between cores and/or processors, and to save power. This setup also helps Intel achieve an extremely parallel and scalable platform where multiple Nervana NNP Xeon co-processors on the same and remote boards effectively act as a massive singular compute unit!

Intel Lake Crest Block Diagram.jpg
 

Intel's Flexpoint is also at the heart of the Nervana NNP and allegedly allows Intel to achieve similar results to FP32 with twice the memory bandwidth while being more power efficient than FP16. Flexpoint is used for the scalar math required for deep learning and uses fixed point 16-bit multiply and addition operations with a shared 5-bit exponent. Unlike FP16, Flexpoint uses all 16-bits of address space for the mantissa and passes the exponent in the instruction. The NNP architecture also features zero cycle transpose operations and optimizations for matrix multiplication and convolutions to optimize silicon usage.

Software control allows users to dial in the performance for their specific workloads, and since many of the math operations and data movement are known or expected in advance, users can keep data as close to the compute units working on that data as possible while minimizing HBM2 memory accesses and data movements across the die to prevent congestion and optimize power usage.

Intel is currently working with Facebook and hopes to have its deep learning products out early next year. The company may have axed Knights Hill, but it is far from giving up on this extremely lucrative market as it continues to push towards exascale computing and AI. Intel is pushing for a 100x increase in neural network performance by 2020 which is a tall order but Intel throwing its weight around in this ring is something that should give GPU makers pause as such an achievement could cut heavily into their GPGPU-powered entries into this market that is only just starting to heat up.

You won't be running Crysis or even Minecraft on this thing, but you might be using software on your phone for augmented reality or in your autonomous car that is running inference routines on a neural network that was trained on one of these chips soon enough! It's specialized and niche, but still very interesting.

Also read:

Source: Intel

OTOY Discussed AI Denoising at Unite Austin

Subject: General Tech | October 4, 2017 - 08:59 PM |
Tagged: 3D rendering, otoy, Unity, deep learning

When raytracing images, sample count has a massive impact on both quality and rendering performance. This corresponds to the number of rays within a pixel that were cast, which, when averaged out over many, many rays, eventually matches what the pixel should be. Think of it this way: if your first ray bounces directly into a bright light, and the second ray bounces into the vacuum of space, should the color be white? Black? Half-grey? Who knows! However, if you send 1000 rays with some randomized pattern, then the average is probably a lot closer to what it should be (which depends on how big the light is, what it bounces off of, etc.).

At Unite Austin, which started today, OTOY showed off an “AI temporal denoiser” algorithm for raytraced footage. Typically, an artist chooses a sample rate that looks good enough to the end viewer. In this case, the artist only needs to choose enough samples that an AI can create a good-enough video for the end user. While I’m curious how much performance is required in the inferencing stage, I do know how much a drop in sample rate can affect render times, and it’s a lot.

Check out OTOY’s video, embed above.

Unity Labs Announces Global Research Fellowship

Subject: General Tech | June 28, 2017 - 11:17 PM |
Tagged: Unity, machine learning, deep learning

Unity, who makes the popular 3D game engine of the same name, has announced a research fellowship for integrating machine learning into game development. Two students, who must have been enrolled in a Masters or a PhD program on June 26th, will be selected and provided with $30,000 for a 6-month fellowship. The deadline is midnight (PDT) on September 9th.

unity-logo-rgb.png

We’re beginning to see a lot of machine-learning applications being discussed for gaming. There are some cases, like global illumination and fluid simulations, where it could be faster for a deep-learning algorithm to hallucinate a convincing than a physical solver will produce a correct one. In this case, it makes sense to post-process each frame, so, naturally, game engine developers are paying attention.

If eligible, you can apply on their website.

Source: Unity

Fluid Simulations via Machine Learning Demo for SIGGRAPH

Subject: General Tech | May 29, 2017 - 08:46 PM |
Tagged: machine learning, fluid, deep neural network, deep learning

SIGGRAPH 2017 is still a few months away, but we’re already starting to see demos get published as groups try to get them accepted to various parts of the trade show. In this case, Physics Forests published a two-minute video where they perform fluid simulations without actually simulating fluid dynamics. Instead, they used a deep-learning AI to hallucinate a convincing fluid dynamics result given their inputs.

We’re seeing a lot of research into deep-learning AIs for complex graphics effects lately. The goal of most of these simulations, whether they are for movies or video games, is to create an effect that convinces the viewer that what they see is realistic. The goal is not to create an actually realistic effect. The question then becomes, “Is it easier to actually solve the problem? Or is it easier having an AI learn, based on a pile of data sorted into successes and failures, come up with an answer that looks correct to the viewer?”

In a lot of cases, like global illumination and even possibly anti-aliasing, it might be faster to have an AI trick you. Fluid dynamics is just one example.

Alphabet Inc.'s DeepMind Working on StarCraft II AI

Subject: General Tech | November 4, 2016 - 02:55 PM |
Tagged: blizzard, google, ai, deep learning, Starcraft II

Blizzard and DeepMind, which was acquired by Google in 2014 and is now a subsidiary of Alphabet Inc., have just announced opening up StarCraft II for AI research. DeepMind was the company that made AlphaGo, which beat Lee Sedol, a grandmaster of Go, in a best-of-five showmatch with a score of four to one. They hinted at possibly having a BlizzCon champion, some year, do a showmatch as well, which would be entertaining.

blizzard-2016-blizzcon-deepmindgoogle.jpg

StarCraft II is different from Go in three important ways. First, any given player knows what they scout, which they apparently will constrain these AI to honor. Second, there are three possible match-ups for any choice of race, except random, which has nine. Third, it's real-time, which can be good for AI, because they're not constrained by human input limitations, but also difficult from a performance standpoint.

From Blizzard's perspective, better AI can be useful, because humans need to be challenged to learn. Novices won't be embarrassed to lose to a computer over and over, so they can have a human-like opponent to experiment with. Likewise, grandmasters will want to have someone better than them to keep advancing, especially if it allows them to keep new strategies hidden. From DeepMind's perspective, this is another step in AI research, which could be applied to science, medicine, and so forth in the coming years and decades.

Unfortunately, this is an early announcement. We don't know any more details, although they will have a Blizzcon panel on Saturday at 1pm EDT (10am PDT).

Source: Blizzard

NVIDIA Teases Low Power, High Performance Xavier SoC That Will Power Future Autonomous Vehicles

Subject: Processors | October 1, 2016 - 06:11 PM |
Tagged: xavier, Volta, tegra, SoC, nvidia, machine learning, gpu, drive px 2, deep neural network, deep learning

Earlier this week at its first GTC Europe event in Amsterdam, NVIDIA CEO Jen-Hsun Huang teased a new SoC code-named Xavier that will be used in self-driving cars and feature the company's newest custom ARM CPU cores and Volta GPU. The new chip will begin sampling at the end of 2017 with product releases using the future Tegra (if they keep that name) processor as soon as 2018.

NVIDIA_Xavier_SOC.jpg

NVIDIA's Xavier is promised to be the successor to the company's Drive PX 2 system which uses two Tegra X2 SoCs and two discrete Pascal MXM GPUs on a single water cooled platform. These claims are even more impressive when considering that NVIDIA is not only promising to replace the four processors but it will reportedly do that at 20W – less than a tenth of the TDP!

The company has not revealed all the nitty-gritty details, but they did tease out a few bits of information. The new processor will feature 7 billion transistors and will be based on a refined 16nm FinFET process while consuming a mere 20W. It can process two 8k HDR video streams and can hit 20 TOPS (NVIDIA's own rating for deep learning int(8) operations).

Specifically, NVIDIA claims that the Xavier SoC will use eight custom ARMv8 (64-bit) CPU cores (it is unclear whether these cores will be a refined Denver architecture or something else) and a GPU based on its upcoming Volta architecture with 512 CUDA cores. Also, in an interesting twist, NVIDIA is including a "Computer Vision Accelerator" on the SoC as well though the company did not go into many details. This bit of silicon may explain how the ~300mm2 die with 7 billion transistors is able to match the 7.2 billion transistor Pascal-based Telsa P4 (2560 CUDA cores) graphics card at deep learning (tera-operations per second) tasks. Of course in addition to the incremental improvements by moving to Volta and a new ARMv8 CPU architectures on a refined 16nm FF+ process.

  Drive PX Drive PX 2 NVIDIA Xavier Tesla P4
CPU 2 x Tegra X1 (8 x A57 total) 2 x Tegra X2 (8 x A57 + 4 x Denver total) 1 x Xavier SoC (8 x Custom ARM + 1 x CVA) N/A
GPU 2 x Tegra X1 (Maxwell) (512 CUDA cores total 2 x Tegra X2 GPUs + 2 x Pascal GPUs 1 x Xavier SoC GPU (Volta) (512 CUDA Cores) 2560 CUDA Cores (Pascal)
TFLOPS 2.3 TFLOPS 8 TFLOPS ? 5.5 TFLOPS
DL TOPS ? 24 TOPS 20 TOPS 22 TOPS
TDP ~30W (2 x 15W) 250W 20W up to 75W
Process Tech 20nm 16nm FinFET 16nm FinFET+ 16nm FinFET
Transistors ? ? 7 billion 7.2 billion

For comparison, the currently available Tesla P4 based on its Pascal architecture has a TDP of up to 75W and is rated at 22 TOPs. This would suggest that Volta is a much more efficient architecture (at least for deep learning and half precision)! I am not sure how NVIDIA is able to match its GP104 with only 512 Volta CUDA cores though their definition of a "core" could have changed and/or the CVA processor may be responsible for closing that gap. Unfortunately, NVIDIA did not disclose what it rates the Xavier at in TFLOPS so it is difficult to compare and it may not match GP104 at higher precision workloads. It could be wholly optimized for int(8) operations rather than floating point performance. Beyond that I will let Scott dive into those particulars once we have more information!

Xavier is more of a teaser than anything and the chip could very well change dramatically and/or not hit the claimed performance targets. Still, it sounds promising and it is always nice to speculate over road maps. It is an intriguing chip and I am ready for more details, especially on the Volta GPU and just what exactly that Computer Vision Accelerator is (and will it be easy to program for?). I am a big fan of the "self-driving car" and I hope that it succeeds. It certainly looks to continue as Tesla, VW, BMW, and other automakers continue to push the envelope of what is possible and plan future cars that will include smart driving assists and even cars that can drive themselves. The more local computing power we can throw at automobiles the better and while massive datacenters can be used to train the neural networks, local hardware to run and make decisions are necessary (you don't want internet latency contributing to the decision of whether to brake or not!).

I hope that NVIDIA's self-proclaimed "AI Supercomputer" turns out to be at least close to the performance they claim! Stay tuned for more information as it gets closer to launch (hopefully more details will emerge at GTC 2017 in the US).

What are your thoughts on Xavier and the whole self-driving car future?

Also read:

Source: NVIDIA