Subject: General Tech | October 4, 2018 - 09:58 PM | Tim Verry
Tagged: Xilinx, FPGA, hardware acceleration, big data, HPC, neural network, ai inference, inference
During the Xilinx Developer Forum in San Jose earlier this week, Xilinx showed off a server built in partnership with AMD that uses FPGA-based hardware acceleration cards to break an inference record in GoogLeNet by hitting up to 30,000 images per second in total high-performance AI inference throughput. GoogLeNet is a 22 layer deep convolutional neural network (PDF) that was started as a project for the ImageNet Large Scale Visual Recognition Challenge in 2014.
Xilinx was able to achieve such high performance while maintaining low latency windows by using eight of its Alveo U250 acceleration add-in-cards that use FPGAs based on its 16nm UltraScale architecture. The cards are hosted by a dual socket AMD server motherboard with two Epyc 7551 processors and eight channels of DDR4 memory. The AMD-based system has two 32 core (64 threads) Zen architecture processors (180W) each clocked at 2 GHz (2.55 GHz all core turbo and 3 GHz maximum turbo) with 64 MB L3, memory controllers supporting up to 2TB per socket of DDR4 memory (341 GB/s of bandwidth in a two socket configuration), and 128 PCI-Express lanes. The Xilinx Alveo U250 cards offer up to 33.3 INT8 TOPs and feature 54MB SRAM (38TB/s) and 64GB of off-chip memory (77GB/s). Interfaces include the PCI-E 3.0 x16 connection as well as two QSFP28 (100GbE) connections. The cards are rated at 225W TDPs and cost a whopping $12,995 MSRP each. The FPGA cards alone push the system well into the six-figure range before including the Epyc server CPUs, all that system memory, and the other base components. It is not likely you will see this system in your next Tesla any time soon, but it is a nice proof of concept at what future technology generations may be able to achieve at much more economical price points and used for AI inference tasks in everyday life (driver assistance, medical imaging, big data analytics driving market research that influences consumer pricing, etc).
Interestingly, this system may hold the current record, but it is not likely to last very long even against Xilinx’s own hardware. Specifically, Xilinx’s Versal ACAP cards (set to release in the second half of next year) are slated to hit up to 150W TDPs (in the add-in-card models) while being up to eight times faster than Xilinx’s previous FPGAs. The Versal ACAPs will use TSMCs 7nm FinFET node and will combine scalar processing engines (ARM CPUs), adaptable hardware engines (FPGAs with a new full software stack and much faster on-the-fly dynamic reconfiguration), and AI engines (DSPs, SIMD vector cores, and dedicated fixed function units for inference tasks) with a Network on Chip (NoC) and customizable memory hierarchy. Xilinx also has fierce competition on its hands in this huge AI/machine learning/deep neural network market with Intel/Altera and its Stratix FPGAs, AMD and NVIDIA with their GPUs and new AI focused cores, and other specialty hardware accelerator manufacturers including Google with its TPUs. (There's also ARM's Project Trillium for mobile.) I am interested to see what the new AI inference performance bar will be set to by this time next year!
Intro and NNEF 1.0 Finalization
SIGGRAPH 2018 is a huge computer graphics expo that occurs in a seemingly random host city around North America. (Asia has a sister event, called SIGGRAPH Asia, which likewise shuffles around.) In the last twenty years, the North American SIGGRAPH seems to like Los Angeles, which hosted the event nine times over that period, but Vancouver won out this year. As you would expect, the maintainers of OpenGL and Vulkan are there, and they have a lot to talk about.
- NNEF 1.0 has been finalized and released!
- The first public demo of OpenXR is available and on the show floor.
- glTF Texture Transmission Extension is being discussed.
- OpenCL Ecosystem Roadmap is being discussed.
- Khronos Educators Program has launched.
I will go through each of these points. Feel free to skip around between the sections that interest you!
How deep is your learning?
Recently, we've had some hands-on time with NVIDIA's new TITAN V graphics card. Equipped with the GV100 GPU, the TITAN V has shown us some impressive results in both gaming and GPGPU compute workloads.
However, one of the most interesting areas that NVIDIA has been touting for GV100 has been deep learning. With a 1.33x increase in single-precision FP32 compute over the Titan Xp, and the addition of specialized Tensor Cores for deep learning, the TITAN V is well positioned for deep learning workflows.
In mathematics, a tensor is a multi-dimensional array of numerical values with respect to a given basis. While we won't go deep into the math behind it, Tensors are a crucial data structure for deep learning applications.
NVIDIA's Tensor Cores aim to accelerate Tensor-based math by utilizing half-precision FP16 math in order to process both dimensions of a Tensor at the same time. The GV100 GPU contains 640 of these Tensor Cores to accelerate FP16 neural network training.
It's worth noting that these are not the first Tensor operation-specific hardware, with others such as Google developing hardware for these specific functions.
|PC Perspective Deep Learning Testbed|
|Processor||AMD Ryzen Threadripper 1920X|
|Motherboard||GIGABYTE X399 AORUS Gaming 7|
|Memory||64GB Corsair Vengeance RGB DDR4-3000|
|Storage||Samsung SSD 960 Pro 2TB|
|Power Supply||Corsair AX1500i 1500 watt|
|OS||Ubuntu 16.04.3 LTS|
|Drivers||AMD: AMD GPU Pro 17.50
For our NVIDIA testing, we used the NVIDIA GPU Cloud 17.12 Docker containers for both TensorFlow and Caffe2 inside of our Ubuntu 16.04.3 host operating system.
For all tests, we are using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) data set.