Subject: General Tech | October 4, 2018 - 09:58 PM | Tim Verry
Tagged: Xilinx, FPGA, hardware acceleration, big data, HPC, neural network, ai inference, inference
During the Xilinx Developer Forum in San Jose earlier this week, Xilinx showed off a server built in partnership with AMD that uses FPGA-based hardware acceleration cards to break an inference record in GoogLeNet by hitting up to 30,000 images per second in total high-performance AI inference throughput. GoogLeNet is a 22 layer deep convolutional neural network (PDF) that was started as a project for the ImageNet Large Scale Visual Recognition Challenge in 2014.
Xilinx was able to achieve such high performance while maintaining low latency windows by using eight of its Alveo U250 acceleration add-in-cards that use FPGAs based on its 16nm UltraScale architecture. The cards are hosted by a dual socket AMD server motherboard with two Epyc 7551 processors and eight channels of DDR4 memory. The AMD-based system has two 32 core (64 threads) Zen architecture processors (180W) each clocked at 2 GHz (2.55 GHz all core turbo and 3 GHz maximum turbo) with 64 MB L3, memory controllers supporting up to 2TB per socket of DDR4 memory (341 GB/s of bandwidth in a two socket configuration), and 128 PCI-Express lanes. The Xilinx Alveo U250 cards offer up to 33.3 INT8 TOPs and feature 54MB SRAM (38TB/s) and 64GB of off-chip memory (77GB/s). Interfaces include the PCI-E 3.0 x16 connection as well as two QSFP28 (100GbE) connections. The cards are rated at 225W TDPs and cost a whopping $12,995 MSRP each. The FPGA cards alone push the system well into the six-figure range before including the Epyc server CPUs, all that system memory, and the other base components. It is not likely you will see this system in your next Tesla any time soon, but it is a nice proof of concept at what future technology generations may be able to achieve at much more economical price points and used for AI inference tasks in everyday life (driver assistance, medical imaging, big data analytics driving market research that influences consumer pricing, etc).
Interestingly, this system may hold the current record, but it is not likely to last very long even against Xilinx’s own hardware. Specifically, Xilinx’s Versal ACAP cards (set to release in the second half of next year) are slated to hit up to 150W TDPs (in the add-in-card models) while being up to eight times faster than Xilinx’s previous FPGAs. The Versal ACAPs will use TSMCs 7nm FinFET node and will combine scalar processing engines (ARM CPUs), adaptable hardware engines (FPGAs with a new full software stack and much faster on-the-fly dynamic reconfiguration), and AI engines (DSPs, SIMD vector cores, and dedicated fixed function units for inference tasks) with a Network on Chip (NoC) and customizable memory hierarchy. Xilinx also has fierce competition on its hands in this huge AI/machine learning/deep neural network market with Intel/Altera and its Stratix FPGAs, AMD and NVIDIA with their GPUs and new AI focused cores, and other specialty hardware accelerator manufacturers including Google with its TPUs. (There's also ARM's Project Trillium for mobile.) I am interested to see what the new AI inference performance bar will be set to by this time next year!
Subject: Storage | October 5, 2017 - 01:37 AM | Tim Verry
Tagged: western digital, SMR, hgst, HelioSeal, big data, 14tb
Western Digital is raising the enterprise hard drive stakes once again with the announcement of a 14 TB 3.5” hard drive. The HGST branded Ultrastar Hs14 uses fourth generation HelioSeal and second generation host-managed SMR (shingled magnetic recording) to enable a 14 TB drive that is just as fast as its smaller capacity enterprise predecessors despite the impressive 1034 Gb/sq in areal density. Western Digital claims the new hard drive offers up 40% more capacity and twice the sequential write performance of its previous SMR drives.
The 3.5” SMR hard drive comes in SATA 6Gbps and SAS 12 Gbps flavors with both equipped with 512 MB cache, operating at 7200 RPM, and supporting maximum sustained transfer speeds of 233 MB/s. The enterprise drive is geared towards sequential writes and is intended to be the storage target for big data applications like Facebook, video streaming services, and research and financial workloads that generate absolutely massive amounts of raw data that needs to sit in archival storage but remain easily accessible (where tape is not as desirable). According to the data sheet (PDF), it is also aimed at bulk cloud storage and online backup as well as businesses storing compliance, audit, and regulatory records.
For those curious about Shingled Magnetic Recording (SMR), Allyn shared some thoughts on the technology here.
Western Digital rates the drive at 550 TB/year and supports the Hs14 with a five year warranty. The drive is currently being sampled to a small number of OEMs with wider availability to follow.
Subject: Memory | February 3, 2017 - 08:42 PM | Tim Verry
Tagged: XPoint, server, Optane, Intel Optane, Intel, big data
Last week Hexus reported that Intel has begun shipping Optane memory modules to its partners for testing. This year should see the launch of both these enterprise products designed for servers as well as tiny application accelerator M.2 solid state drives based on the Intel and Micron joint 3D memory venture. The modules that Intel is shipping are the former type of Optane memory and will be able to replace DDR4 DIMMs (RAM) with a memory solution that is not as fast but is cheaper and has much larger storage capacities. The Optane modules are designed to slot into DDR4 type memory slots on server boards. The benefit for such a product lies in big data and scientific workloads where massive datasets will be able to be held in primary memory and the processor(s) will be able to access the data sets at much lower latencies than if it had to reach out to mass storage on spinning rust or even SAS or PCI-E solid state drives. Being able to hold all the data being worked on in one pool of memory will be cheaper with Optane as well as it is allegedly priced closer to NAND than RAM and the cost of RAM adds up extremely quickly when you need many terabytes of it (or more!). Various technologies attempting to bring higher capacity non volatile and/or flash-based storage in memory module form have been theorized or in the works in various forms for years now, but it appears that Intel will be the first ones to roll out actual products.
It will likely be years before the technology trickles down to consumer desktops and notebooks, so slapping what would effectively be a cheap RAM disk into your PC is still a ways out. Consumers will get a small taste of the Optane memory in the form of tiny storage drives that were rumored for a first quarter 2017 release following its Kaby Lake Z270 motherboards. Previous leaks suggest that the Intel Optane Memory 8000P would come in 16 GB and 32 GB capacities in a M.2 form factor. With a single 128-bit (16 GB) die Intel is able to hit speeds that current NAND flash based SSDs can only hit with multiple dies. Specifically the 16GB Optane application accelerator drive is allegedly capable of 285,000 random 4K IOPS, 70,000 random write 4K IOPS, Sequential 128K reads of 1400 MB/s, and sequential 128K writes of 300 MB/s. The 32GB Optane drive is a bit faster at 300,000 4K IOPS, 120,000 4K IOPS, 1600 MB/s, and 500 MB/s respectively.
Unfortunately, I do not have any numbers on how fast the Optane memory that will slot into the DDR4 slots will be, but seeing as two dies already max out the x2 PCI-E link they use in the M.2 Optane SSD, a dual sided memory module packed with rows of Optane dies on the significantly wider memory bus is very promising. It should lie somewhere closer to (but slower than) DDR4 but much faster than NAND flash while still being non volatile (it doesn't need constant power to retain the data).
I am interested to see what the final numbers are for Intel's Optane RAM and Optane storage drives. The company has certainly dialed down the hype for the technology as it approached fruition though that may be more to do with what they are able to do right now versus what the 3D XPoint memory technology itself is potentially capable of enabling. I look forward to what it will enable in the HPC market and eventually what will be possible for the desktop and gaming markets.
What are your thoughts on Intel and Micron's 3D XPoint memory and Intel's Optane implementation (Micron's implementation is QuantX)?
- IDF 2016: Intel To Demo Optane XPoint, Announces Optane Testbed for Enterprise Customers
- Intel Optane (XPoint) First Gen Product Specifications Leaked
- Intel Z270 Express and H270 Express Chipsets Support Kaby Lake, More PCI-E 3.0 Lanes
Subject: Graphics Cards | April 5, 2016 - 02:13 AM | Tim Verry
Tagged: HPC, hbm, gpgpu, firepro s9300x2, firepro, dual fiji, deep learning, big data, amd
Earlier this month AMD launched a dual Fiji powerhouse for VR gamers it is calling the Radeon Pro Duo. Now, AMD is bringing its latest GCN architecture and HBM memory to servers with the dual GPU FirePro S9300 x2.
The new server-bound professional graphics card packs an impressive amount of computing hardware into a dual-slot card with passive cooling. The FirePro S9300 x2 combines two full Fiji GPUs clocked at 850 MHz for a total of 8,192 cores, 512 TUs, and 128 ROPs. Each GPU is paired with 4GB of non-ECC HBM memory on package with 512GB/s of memory bandwidth which AMD combines to advertise this as the first professional graphics card with 1TB/s of memory bandwidth.
Due to lower clockspeeds the S9300 x2 has less peak single precision compute performance versus the consumer Radeon Pro Duo at 13.9 TFLOPS versus 16 TFLOPs on the desktop card. Businesses will be able to cram more cards into their rack mounted servers though since they do not need to worry about mounting locations for the sealed loop water cooling of the Radeon card.
|FirePro S9300 x2||Radeon Pro Duo||R9 Fury X||FirePro S9170|
|GPU||Dual Fiji||Dual Fiji||Fiji||Hawaii|
|GPU Cores||8192 (2 x 4096)||8192 (2 x 4096)||4096||2816|
|Rated Clock||850 MHz||1050 MHz||1050 MHz||930 MHz|
|Texture Units||2 x 256||2 x 256||256||176|
|ROP Units||2 x 64||2 x 64||64||64|
|Memory||8GB (2 x 4GB)||8GB (2 x 4GB)||4GB||32GB ECC|
|Memory Clock||500 MHz||500 MHz||500 MHz||5000 MHz|
|Memory Interface||4096-bit (HBM) per GPU||4096-bit (HBM) per GPU||4096-bit (HBM)||512-bit|
|Memory Bandwidth||1TB/s (2 x 512GB/s)||1TB/s (2 x 512GB/s)||512 GB/s||320 GB/s|
|TDP||300 watts||?||275 watts||275 watts|
|Peak Compute||13.9 TFLOPS||16 TFLOPS||8.60 TFLOPS||5.24 TFLOPS|
AMD is aiming this card at datacenter and HPC users working on "big data" tasks that do not require the accuracy of double precision floating point calculations. Deep learning tasks, seismic processing, and data analytics are all examples AMD says the dual GPU card will excel at. These are all tasks that can be greatly accelerated by the massive parallel nature of a GPU but do not need to be as precise as stricter mathematics, modeling, and simulation work that depend on FP64 performance. In that respect, the FirePro S9300 x2 has only 870 GLFOPS of double precision compute performance.
Further, this card supports a GPGPU optimized Linux driver stack called GPUOpen and developers can program for it using either OpenCL (it supports OpenCL 1.2) or C++. AMD PowerTune, and the return of FP16 support are also features. AMD claims that its new dual GPU card is twice as fast as the NVIDIA Tesla M40 (1.6x the K80) and 12 times as fast as the latest Intel Xeon E5 in peak single precision floating point performance.
The double slot card is powered by two PCI-E power connectors and is rated at 300 watts. This is a bit more palatable than the triple 8-pin needed for the Radeon Pro Duo!
The FirePro S9300 x2 comes with a 3 year warranty and will be available in the second half of this year for $6000 USD. You are definitely paying a premium for the professional certifications and support. Here's hoping developers come up with some cool uses for the dual 8.9 Billion transistor GPUs and their included HBM memory!