AMD, a little too far ahead of the curve again?

Subject: General Tech | December 27, 2017 - 11:42 AM |
Tagged: nvidia, Intel, HBM2, deep learning

AMD has never been afraid to try new things, from hitting 1GHz first, to creating a true multicore processor, most recently adopting HBM and HBM2 into their graphics cards.  That move contributed to some of their recent difficulties with the current generation of GPUs; HBM is more expensive to produce and more of a challenge to implement.  While they were the first to implement HBM, it is NVIDIA and Intel which are benefiting from AMD's experimental nature.  Their new generation of HPC solutions, the Tesla P100, Quadro GP 100 and Lake Crest all use HBM2 and benefit from the experience Hynix, Samsung and TSMC gained fabbing the first generation.  Vega products offer slightly less memory bandwidth as well as lagging behind in overall performance, a drawback to being first.

On a positive note, AMD have now had more experience designing chips which make use of HBM and this could offer a new hope for the next generation of cards, both gaming and HPC flavours.  DigiTimes briefly covers the two processes manufacturers use in the production of HBM here.

_id1460366655_343178_1.jpg

"However, Intel's release of its deep-learning chip, Lake Crest, which came following its acquisition of Nervana, has come with HMB2. This indicates that HBM-based architecture will be the main development direction of memory solutions for HPC solutions by GPU vendors."

Here is some more Tech News from around the web:

Tech Talk

 

Source: DigiTimes

How deep is your learning?

Recently, we've had some hands-on time with NVIDIA's new TITAN V graphics card. Equipped with the GV100 GPU, the TITAN V has shown us some impressive results in both gaming and GPGPU compute workloads.

However, one of the most interesting areas that NVIDIA has been touting for GV100 has been deep learning. With a 1.33x increase in single-precision FP32 compute over the Titan Xp, and the addition of specialized Tensor Cores for deep learning, the TITAN V is well positioned for deep learning workflows.

In mathematics, a tensor is a multi-dimensional array of numerical values with respect to a given basis. While we won't go deep into the math behind it, Tensors are a crucial data structure for deep learning applications.

07.jpg

NVIDIA's Tensor Cores aim to accelerate Tensor-based math by utilizing half-precision FP16 math in order to process both dimensions of a Tensor at the same time. The GV100 GPU contains 640 of these Tensor Cores to accelerate FP16 neural network training.

It's worth noting that these are not the first Tensor operation-specific hardware, with others such as Google developing hardware for these specific functions.

Test Setup

  PC Perspective Deep Learning Testbed
Processor AMD Ryzen Threadripper 1920X
Motherboard GIGABYTE X399 AORUS Gaming 7
Memory 64GB Corsair Vengeance RGB DDR4-3000 
Storage Samsung SSD 960 Pro 2TB
Power Supply Corsair AX1500i 1500 watt
OS Ubuntu 16.04.3 LTS
Drivers AMD: AMD GPU Pro 17.50
NVIDIA: 387.34

For our NVIDIA testing, we used the NVIDIA GPU Cloud 17.12 Docker containers for both TensorFlow and Caffe2 inside of our Ubuntu 16.04.3 host operating system.

AMD testing was done using the hiptensorflow port from the AMD ROCm GitHub repositories.

For all tests, we are using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) data set.

Continue reading our look at deep learning performance with the NVIDIA Titan V!!

Intel Sheds More Light On Benefits of Nervana Neural Network Processor

Subject: General Tech, Processors | December 12, 2017 - 04:52 PM |
Tagged: training, nnp, nervana, Intel, flexpoint, deep learning, asic, artificial intelligence

Intel recently provided a few insights into its upcoming Nervana Neural Network Processor (NNP) on its blog. Built in partnership with deep learning startup Nervana Systems which Intel acquired last year for over $400 million, the AI-focused chip previously codenamed Lake Crest is built on a new architecture designed from the ground up to accelerate neural network training and AI modeling.

new_nervana_chip-fb.jpg

The full details of the Intel NNP are still unknown, but it is a custom ASIC with a Tensor-based architecture placed on a multi-chip module (MCM) along with 32GB of HBM2 memory. The Nervana NNP supports optimized and power efficient Flexpoint math and interconnectivity is huge on this scalable platform. Each AI accelerator features 12 processing clusters (with an as-yet-unannounced number of "cores" or processing elements) paired with 12 proprietary inter-chip links that 20-times faster than PCI-E, four HBM2 memory controllers, a management-controller CPU, as well as standard SPI, I2C, GPIO, PCI-E x16, and DMA I/O. The processor is designed to be highly configurable and to meet both mode and data parallelism goals.

The processing elements are all software controlled and can communicate with each other using high speed bi-directional links at up to a terabit per second. Each processing element has more than 2MB of local memory and the Nervana NNP has 30MB in total of local memory. Memory accesses and data sharing is managed with QOS software which controls adjustable bandwidth over multiple virtual channels with multiple priorities per channel. Processing elements can talk to and send/receive data between each other and the HBM2 stacks locally as well as off die to processing elements and HBM2 on other NNP chips. The idea is to allow as much internal sharing as possible and to keep as much data stored and transformed in local data as possible in order to save precious HBM2 bandwidth (1TB/s) for pre-fetching upcoming tensors, reduce the number of hops and resulting latency by not having to go out to the HBM2 memory and back to transfer data between cores and/or processors, and to save power. This setup also helps Intel achieve an extremely parallel and scalable platform where multiple Nervana NNP Xeon co-processors on the same and remote boards effectively act as a massive singular compute unit!

Intel Lake Crest Block Diagram.jpg
 

Intel's Flexpoint is also at the heart of the Nervana NNP and allegedly allows Intel to achieve similar results to FP32 with twice the memory bandwidth while being more power efficient than FP16. Flexpoint is used for the scalar math required for deep learning and uses fixed point 16-bit multiply and addition operations with a shared 5-bit exponent. Unlike FP16, Flexpoint uses all 16-bits of address space for the mantissa and passes the exponent in the instruction. The NNP architecture also features zero cycle transpose operations and optimizations for matrix multiplication and convolutions to optimize silicon usage.

Software control allows users to dial in the performance for their specific workloads, and since many of the math operations and data movement are known or expected in advance, users can keep data as close to the compute units working on that data as possible while minimizing HBM2 memory accesses and data movements across the die to prevent congestion and optimize power usage.

Intel is currently working with Facebook and hopes to have its deep learning products out early next year. The company may have axed Knights Hill, but it is far from giving up on this extremely lucrative market as it continues to push towards exascale computing and AI. Intel is pushing for a 100x increase in neural network performance by 2020 which is a tall order but Intel throwing its weight around in this ring is something that should give GPU makers pause as such an achievement could cut heavily into their GPGPU-powered entries into this market that is only just starting to heat up.

You won't be running Crysis or even Minecraft on this thing, but you might be using software on your phone for augmented reality or in your autonomous car that is running inference routines on a neural network that was trained on one of these chips soon enough! It's specialized and niche, but still very interesting.

Also read:

Source: Intel

OTOY Discussed AI Denoising at Unite Austin

Subject: General Tech | October 4, 2017 - 08:59 PM |
Tagged: 3D rendering, otoy, Unity, deep learning

When raytracing images, sample count has a massive impact on both quality and rendering performance. This corresponds to the number of rays within a pixel that were cast, which, when averaged out over many, many rays, eventually matches what the pixel should be. Think of it this way: if your first ray bounces directly into a bright light, and the second ray bounces into the vacuum of space, should the color be white? Black? Half-grey? Who knows! However, if you send 1000 rays with some randomized pattern, then the average is probably a lot closer to what it should be (which depends on how big the light is, what it bounces off of, etc.).

At Unite Austin, which started today, OTOY showed off an “AI temporal denoiser” algorithm for raytraced footage. Typically, an artist chooses a sample rate that looks good enough to the end viewer. In this case, the artist only needs to choose enough samples that an AI can create a good-enough video for the end user. While I’m curious how much performance is required in the inferencing stage, I do know how much a drop in sample rate can affect render times, and it’s a lot.

Check out OTOY’s video, embed above.

Unity Labs Announces Global Research Fellowship

Subject: General Tech | June 28, 2017 - 11:17 PM |
Tagged: Unity, machine learning, deep learning

Unity, who makes the popular 3D game engine of the same name, has announced a research fellowship for integrating machine learning into game development. Two students, who must have been enrolled in a Masters or a PhD program on June 26th, will be selected and provided with $30,000 for a 6-month fellowship. The deadline is midnight (PDT) on September 9th.

unity-logo-rgb.png

We’re beginning to see a lot of machine-learning applications being discussed for gaming. There are some cases, like global illumination and fluid simulations, where it could be faster for a deep-learning algorithm to hallucinate a convincing than a physical solver will produce a correct one. In this case, it makes sense to post-process each frame, so, naturally, game engine developers are paying attention.

If eligible, you can apply on their website.

Source: Unity

Fluid Simulations via Machine Learning Demo for SIGGRAPH

Subject: General Tech | May 29, 2017 - 08:46 PM |
Tagged: machine learning, fluid, deep neural network, deep learning

SIGGRAPH 2017 is still a few months away, but we’re already starting to see demos get published as groups try to get them accepted to various parts of the trade show. In this case, Physics Forests published a two-minute video where they perform fluid simulations without actually simulating fluid dynamics. Instead, they used a deep-learning AI to hallucinate a convincing fluid dynamics result given their inputs.

We’re seeing a lot of research into deep-learning AIs for complex graphics effects lately. The goal of most of these simulations, whether they are for movies or video games, is to create an effect that convinces the viewer that what they see is realistic. The goal is not to create an actually realistic effect. The question then becomes, “Is it easier to actually solve the problem? Or is it easier having an AI learn, based on a pile of data sorted into successes and failures, come up with an answer that looks correct to the viewer?”

In a lot of cases, like global illumination and even possibly anti-aliasing, it might be faster to have an AI trick you. Fluid dynamics is just one example.

Alphabet Inc.'s DeepMind Working on StarCraft II AI

Subject: General Tech | November 4, 2016 - 02:55 PM |
Tagged: blizzard, google, ai, deep learning, Starcraft II

Blizzard and DeepMind, which was acquired by Google in 2014 and is now a subsidiary of Alphabet Inc., have just announced opening up StarCraft II for AI research. DeepMind was the company that made AlphaGo, which beat Lee Sedol, a grandmaster of Go, in a best-of-five showmatch with a score of four to one. They hinted at possibly having a BlizzCon champion, some year, do a showmatch as well, which would be entertaining.

blizzard-2016-blizzcon-deepmindgoogle.jpg

StarCraft II is different from Go in three important ways. First, any given player knows what they scout, which they apparently will constrain these AI to honor. Second, there are three possible match-ups for any choice of race, except random, which has nine. Third, it's real-time, which can be good for AI, because they're not constrained by human input limitations, but also difficult from a performance standpoint.

From Blizzard's perspective, better AI can be useful, because humans need to be challenged to learn. Novices won't be embarrassed to lose to a computer over and over, so they can have a human-like opponent to experiment with. Likewise, grandmasters will want to have someone better than them to keep advancing, especially if it allows them to keep new strategies hidden. From DeepMind's perspective, this is another step in AI research, which could be applied to science, medicine, and so forth in the coming years and decades.

Unfortunately, this is an early announcement. We don't know any more details, although they will have a Blizzcon panel on Saturday at 1pm EDT (10am PDT).

Source: Blizzard

NVIDIA Teases Low Power, High Performance Xavier SoC That Will Power Future Autonomous Vehicles

Subject: Processors | October 1, 2016 - 06:11 PM |
Tagged: xavier, Volta, tegra, SoC, nvidia, machine learning, gpu, drive px 2, deep neural network, deep learning

Earlier this week at its first GTC Europe event in Amsterdam, NVIDIA CEO Jen-Hsun Huang teased a new SoC code-named Xavier that will be used in self-driving cars and feature the company's newest custom ARM CPU cores and Volta GPU. The new chip will begin sampling at the end of 2017 with product releases using the future Tegra (if they keep that name) processor as soon as 2018.

NVIDIA_Xavier_SOC.jpg

NVIDIA's Xavier is promised to be the successor to the company's Drive PX 2 system which uses two Tegra X2 SoCs and two discrete Pascal MXM GPUs on a single water cooled platform. These claims are even more impressive when considering that NVIDIA is not only promising to replace the four processors but it will reportedly do that at 20W – less than a tenth of the TDP!

The company has not revealed all the nitty-gritty details, but they did tease out a few bits of information. The new processor will feature 7 billion transistors and will be based on a refined 16nm FinFET process while consuming a mere 20W. It can process two 8k HDR video streams and can hit 20 TOPS (NVIDIA's own rating for deep learning int(8) operations).

Specifically, NVIDIA claims that the Xavier SoC will use eight custom ARMv8 (64-bit) CPU cores (it is unclear whether these cores will be a refined Denver architecture or something else) and a GPU based on its upcoming Volta architecture with 512 CUDA cores. Also, in an interesting twist, NVIDIA is including a "Computer Vision Accelerator" on the SoC as well though the company did not go into many details. This bit of silicon may explain how the ~300mm2 die with 7 billion transistors is able to match the 7.2 billion transistor Pascal-based Telsa P4 (2560 CUDA cores) graphics card at deep learning (tera-operations per second) tasks. Of course in addition to the incremental improvements by moving to Volta and a new ARMv8 CPU architectures on a refined 16nm FF+ process.

  Drive PX Drive PX 2 NVIDIA Xavier Tesla P4
CPU 2 x Tegra X1 (8 x A57 total) 2 x Tegra X2 (8 x A57 + 4 x Denver total) 1 x Xavier SoC (8 x Custom ARM + 1 x CVA) N/A
GPU 2 x Tegra X1 (Maxwell) (512 CUDA cores total 2 x Tegra X2 GPUs + 2 x Pascal GPUs 1 x Xavier SoC GPU (Volta) (512 CUDA Cores) 2560 CUDA Cores (Pascal)
TFLOPS 2.3 TFLOPS 8 TFLOPS ? 5.5 TFLOPS
DL TOPS ? 24 TOPS 20 TOPS 22 TOPS
TDP ~30W (2 x 15W) 250W 20W up to 75W
Process Tech 20nm 16nm FinFET 16nm FinFET+ 16nm FinFET
Transistors ? ? 7 billion 7.2 billion

For comparison, the currently available Tesla P4 based on its Pascal architecture has a TDP of up to 75W and is rated at 22 TOPs. This would suggest that Volta is a much more efficient architecture (at least for deep learning and half precision)! I am not sure how NVIDIA is able to match its GP104 with only 512 Volta CUDA cores though their definition of a "core" could have changed and/or the CVA processor may be responsible for closing that gap. Unfortunately, NVIDIA did not disclose what it rates the Xavier at in TFLOPS so it is difficult to compare and it may not match GP104 at higher precision workloads. It could be wholly optimized for int(8) operations rather than floating point performance. Beyond that I will let Scott dive into those particulars once we have more information!

Xavier is more of a teaser than anything and the chip could very well change dramatically and/or not hit the claimed performance targets. Still, it sounds promising and it is always nice to speculate over road maps. It is an intriguing chip and I am ready for more details, especially on the Volta GPU and just what exactly that Computer Vision Accelerator is (and will it be easy to program for?). I am a big fan of the "self-driving car" and I hope that it succeeds. It certainly looks to continue as Tesla, VW, BMW, and other automakers continue to push the envelope of what is possible and plan future cars that will include smart driving assists and even cars that can drive themselves. The more local computing power we can throw at automobiles the better and while massive datacenters can be used to train the neural networks, local hardware to run and make decisions are necessary (you don't want internet latency contributing to the decision of whether to brake or not!).

I hope that NVIDIA's self-proclaimed "AI Supercomputer" turns out to be at least close to the performance they claim! Stay tuned for more information as it gets closer to launch (hopefully more details will emerge at GTC 2017 in the US).

What are your thoughts on Xavier and the whole self-driving car future?

Also read:

Source: NVIDIA

AMD Brings Dual Fiji and HBM Memory To Server Room With FirePro S9300 x2

Subject: Graphics Cards | April 5, 2016 - 02:13 AM |
Tagged: HPC, hbm, gpgpu, firepro s9300x2, firepro, dual fiji, deep learning, big data, amd

Earlier this month AMD launched a dual Fiji powerhouse for VR gamers it is calling the Radeon Pro Duo. Now, AMD is bringing its latest GCN architecture and HBM memory to servers with the dual GPU FirePro S9300 x2.

AMD Firepro S9300x2 Server HPC Card.jpg

The new server-bound professional graphics card packs an impressive amount of computing hardware into a dual-slot card with passive cooling. The FirePro S9300 x2 combines two full Fiji GPUs clocked at 850 MHz for a total of 8,192 cores, 512 TUs, and 128 ROPs. Each GPU is paired with 4GB of non-ECC HBM memory on package with 512GB/s of memory bandwidth which AMD combines to advertise this as the first professional graphics card with 1TB/s of memory bandwidth.

Due to lower clockspeeds the S9300 x2 has less peak single precision compute performance versus the consumer Radeon Pro Duo at 13.9 TFLOPS versus 16 TFLOPs on the desktop card. Businesses will be able to cram more cards into their rack mounted servers though since they do not need to worry about mounting locations for the sealed loop water cooling of the Radeon card.

  FirePro S9300 x2 Radeon Pro Duo R9 Fury X FirePro S9170
GPU Dual Fiji Dual Fiji Fiji Hawaii
GPU Cores 8192 (2 x 4096) 8192 (2 x 4096) 4096 2816
Rated Clock 850 MHz 1050 MHz 1050 MHz 930 MHz
Texture Units 2 x 256 2 x 256 256 176
ROP Units 2 x 64 2 x 64 64 64
Memory 8GB (2 x 4GB) 8GB (2 x 4GB) 4GB 32GB ECC
Memory Clock 500 MHz 500 MHz 500 MHz 5000 MHz
Memory Interface 4096-bit (HBM) per GPU 4096-bit (HBM) per GPU 4096-bit (HBM) 512-bit
Memory Bandwidth 1TB/s (2 x 512GB/s) 1TB/s (2 x 512GB/s) 512 GB/s 320 GB/s
TDP 300 watts ? 275 watts 275 watts
Peak Compute 13.9 TFLOPS 16 TFLOPS 8.60 TFLOPS 5.24 TFLOPS
Transistor Count 17.8B 17.8B 8.9B 8.0B
Process Tech 28nm 28nm 28nm 28nm
Cooling Passive Liquid Liquid Passive
MSRP $6000 $1499 $649 $4000

AMD is aiming this card at datacenter and HPC users working on "big data" tasks that do not require the accuracy of double precision floating point calculations. Deep learning tasks, seismic processing, and data analytics are all examples AMD says the dual GPU card will excel at. These are all tasks that can be greatly accelerated by the massive parallel nature of a GPU but do not need to be as precise as stricter mathematics, modeling, and simulation work that depend on FP64 performance. In that respect, the FirePro S9300 x2 has only 870 GLFOPS of double precision compute performance.

Further, this card supports a GPGPU optimized Linux driver stack called GPUOpen and developers can program for it using either OpenCL (it supports OpenCL 1.2) or C++. AMD PowerTune, and the return of FP16 support are also features. AMD claims that its new dual GPU card is twice as fast as the NVIDIA Tesla M40 (1.6x the K80) and 12 times as fast as the latest Intel Xeon E5 in peak single precision floating point performance. 

The double slot card is powered by two PCI-E power connectors and is rated at 300 watts. This is a bit more palatable than the triple 8-pin needed for the Radeon Pro Duo!

The FirePro S9300 x2 comes with a 3 year warranty and will be available in the second half of this year for $6000 USD. You are definitely paying a premium for the professional certifications and support. Here's hoping developers come up with some cool uses for the dual 8.9 Billion transistor GPUs and their included HBM memory!

Source: AMD

Microsoft Lets Anyone "Git" Their Deep Learning On With Open Source CNTK

Subject: General Tech | February 4, 2016 - 01:18 PM |
Tagged: open source, microsoft, machine learning, deep neural network, deep learning, cntk, azure

Microsoft has been using deep neural networks for awhile now to power its speech recognition technologies bundled into Windows and Skype to identify and follow commands and to translate speech respectively. This technology is part of Microsoft's Computational Network Toolkit. Last April, the company made this toolkit available to academic researchers on Codeplex, and it is now opening it up even more by moving the project to GitHub and placing it under an open source license.

Lead by chief speech and computer scientist Xuedong Huang, a team of Microsoft researchers built the Computational Network Toolkit (CNTK) to power all their speech related projects. The CNTK is a deep neural network for machine learning that is built to be fast and scalable across multiple systems, and more importantly, multiple GPUs which excel at these kinds of parallel processing workloads and algorithms. Microsoft heavily focused on scalability with CNTK and according to the company's own benchmarks (which is to say to be taken with a healthy dose of salt) while the major competing neural network tool kits offer similar performance running on a single GPU, when adding more than one graphics card CNTK is vastly more efficient with almost four times the performance of Google's TensorFlow and a bit more than 1.5-times Torch 7 and Caffe. Where CNTK gets a bit deep learning crazy is its ability to scale beyond a single system and easily tap into Microsoft's Azure GPU Lab to get access to numerous GPUs from their remote datacenters -- though its not free you don't need to purchase, store, and power the hardware locally and can ramp the number up and down based on how much GPU muscle you need. The example Microsoft provided showed two similarly spec'd Linux systems with four GPUs each running on Azure cloud hosting getting close to twice the performance of the 4 GPU system (75% increase). Microsoft claims that "CNTK can easily scale beyond 8 GPUs across multiple machines with superior distributed system performance."

cntk-speed-comparison.png

Using GPU-based Azure machines, Microsoft was able to increase the performance of Cortana's speech recognition by 10-times compared to the local systems they were previously using.

It is always cool to see GPU compute in practice and now that CNTK is available to everyone, I expect to see a lot of new uses for the toolkit beyond speech recognition. Moving to an open source license is certainly good PR, but I think it was actually done more for Microsoft's own benefit rather than users which isn't necessarily a bad thing since both get to benefit from it. I am really interested to see what researchers are able to do with a deep neural network that reportedly offers so much performance thanks to GPUs. I'm curious what new kinds of machine learning opportunities the extra speed will enable.

If you are interested, you can check out CNTK on GitHub!

Source: Microsoft