Subject: Graphics Cards | November 13, 2018 - 02:14 PM | Jeremy Hellstrom
Tagged: turing, RTX 4000, nvidia, HPC, autodesk
NVIDIA's newest Turing based HPC card the RTX 4000 has arrived, with 2304 CUDA cores, 288 Tensor Cores, 36 RT Cores, and 8GB of GDDR6 on-board GPU memory. They haven't released any benchmarks as of yet but do state the new memory will offer a 40% increase in bandwidth compared to the previous P4000 and that the card can produce up to 57 TFLOPs of performance, one assumes this refers to INT8 performance.
They are showing the card off at Autodesk, if you visit they have set up a demo which uses the Enscape3D plugin to let you put on a VR headset to step inside a full-scale Autodesk Revit model and make changes in real time, which would be an interesting way to work. The card will sell for ~$900 which puts in reach of quite a few possible users and might encourage AMD to sell it's Instinct MI60 and MI50 cards for a price in that ballpark.
Subject: General Tech | November 6, 2018 - 03:42 PM | Jeremy Hellstrom
Tagged: AMD Radeon Instinct, MI60, MI50, 7nm, ROCm 2.0, HPC, amd
If you haven't been watching AMD's launch of the 7nm Vega based MI60 and MI50 then you can catch up right here.
You won't be gaming with these beasts, but for those working on deep learning, HPC, cloud computing or rendering apps you might want to take a deeper look. The new PCIe 4.0 cards use HBM2 ECC memory and Infinity Fabric interconnects, offering up to 1 TB/s of memory bandwidth.
The MI60 features 32GB of HBM2 with 64 Compute Units containing 4096 Stream Processors which translates into 59 TOPS INT8, up to 29.5 TFLOPS FP16, 14.7 TFLOPS FP32 and 7.4 TFLOPS FP64. AMD claims is currently the fastest double precision PCIe card on the market, with the 16GB Tesla V100 offering 7 TFLOPS of FP64 performance.
The MI50 is a little less powerful though with 16GB of HBM2, 53.6 TFLOPS of INT8, up to 26.8 TFLOPS FP16, 13.4 TFLOPS FP32 and 6.7 TFLOPS FP64 it is no slouch.
With two Infinity Fabric links per GPU, they can deliver up to 200 GB/s of peer-to-peer bandwidth and you can configure up to four GPUs in a hive ring configuration, made of two hives in eight GPU servers with the help of the new ROCm 2.0 software.
Expect to see AMD in more HPC servers starting at the beginning of the new year, when they start shipping.
Subject: General Tech | October 4, 2018 - 09:58 PM | Tim Verry
Tagged: Xilinx, FPGA, hardware acceleration, big data, HPC, neural network, ai inference, inference
During the Xilinx Developer Forum in San Jose earlier this week, Xilinx showed off a server built in partnership with AMD that uses FPGA-based hardware acceleration cards to break an inference record in GoogLeNet by hitting up to 30,000 images per second in total high-performance AI inference throughput. GoogLeNet is a 22 layer deep convolutional neural network (PDF) that was started as a project for the ImageNet Large Scale Visual Recognition Challenge in 2014.
Xilinx was able to achieve such high performance while maintaining low latency windows by using eight of its Alveo U250 acceleration add-in-cards that use FPGAs based on its 16nm UltraScale architecture. The cards are hosted by a dual socket AMD server motherboard with two Epyc 7551 processors and eight channels of DDR4 memory. The AMD-based system has two 32 core (64 threads) Zen architecture processors (180W) each clocked at 2 GHz (2.55 GHz all core turbo and 3 GHz maximum turbo) with 64 MB L3, memory controllers supporting up to 2TB per socket of DDR4 memory (341 GB/s of bandwidth in a two socket configuration), and 128 PCI-Express lanes. The Xilinx Alveo U250 cards offer up to 33.3 INT8 TOPs and feature 54MB SRAM (38TB/s) and 64GB of off-chip memory (77GB/s). Interfaces include the PCI-E 3.0 x16 connection as well as two QSFP28 (100GbE) connections. The cards are rated at 225W TDPs and cost a whopping $12,995 MSRP each. The FPGA cards alone push the system well into the six-figure range before including the Epyc server CPUs, all that system memory, and the other base components. It is not likely you will see this system in your next Tesla any time soon, but it is a nice proof of concept at what future technology generations may be able to achieve at much more economical price points and used for AI inference tasks in everyday life (driver assistance, medical imaging, big data analytics driving market research that influences consumer pricing, etc).
Interestingly, this system may hold the current record, but it is not likely to last very long even against Xilinx’s own hardware. Specifically, Xilinx’s Versal ACAP cards (set to release in the second half of next year) are slated to hit up to 150W TDPs (in the add-in-card models) while being up to eight times faster than Xilinx’s previous FPGAs. The Versal ACAPs will use TSMCs 7nm FinFET node and will combine scalar processing engines (ARM CPUs), adaptable hardware engines (FPGAs with a new full software stack and much faster on-the-fly dynamic reconfiguration), and AI engines (DSPs, SIMD vector cores, and dedicated fixed function units for inference tasks) with a Network on Chip (NoC) and customizable memory hierarchy. Xilinx also has fierce competition on its hands in this huge AI/machine learning/deep neural network market with Intel/Altera and its Stratix FPGAs, AMD and NVIDIA with their GPUs and new AI focused cores, and other specialty hardware accelerator manufacturers including Google with its TPUs. (There's also ARM's Project Trillium for mobile.) I am interested to see what the new AI inference performance bar will be set to by this time next year!
Subject: General Tech | May 31, 2018 - 01:41 PM | Jeremy Hellstrom
Tagged: jen-hsun huang, GTC, HPC, nvswitch, tesla v100
Jen-Hsun Huang has a busy dance card right now, with several interesting tidbits hitting the news recently, including his statement in this DigiTimes post that GPU development is outstripping Moore's law. The GPU Technology Conference kicked off yesterday in Taiwan 2018, with NVIDIA showing off their brand new HGX-2 GPU which contains both AIs and HPCs with Deep Learnings a sure bet as well. Buzzwords aside, the new accelerator is made up of 16 Tesla V100 GPUs, a mere half terabyte of memory and NVIDIA's NVSwitch. Specialized products from Lenovo and Supermicro, to name a few, as well as cloud providers will also be picking up this newest peice of kit from NVIDIA.
For those less interested in HPC, there is an interesting tidbit of information about an event at Hot Chips, on August 20th Stuart Oberman will be talking about NVIDIA’s Next Generation Mainstream GPU with other sessions dealing with their IoT and fabric connections.
"But demand for that power is "growing, not slowing," thanks to AI, Huang said. "Before this time, software was written by humans and software engineers can only write so much software, but machines don't get tired," he said, adding that every single company in the world that develops software will need an AI supercomputer."
Here is some more Tech News from around the web:
- Asus' new motherboard can hold 20 GPUs for crypto-mining @ The Inquirer
- Internet engineers tear into United Nations' plan to move us all to IPv6 @ The Register
- Microsoft improves Xbox support staffing by not having any @ The Inquirer
Subject: Storage | March 29, 2018 - 10:43 PM | Tim Verry
Tagged: z-ssd, Z-NAND, workstation, Samsung, NVMe, M.2, HPC, enterprise
Samsung is expanding its Z-NAND based "Z-SSD" products with a new M.2 solid state drive for workstations and high-performance compute servers. Previously only available in half-height AIC (add-in-card) form factors, the SZ983 M.2 sports a M.2 22110 form factor and NVMe compatible PCI-E 3.0 x4 interface. The new drive was shown off at Samsung's booth during the Open Compute Project Summit in San Jose and was spotted by Anandtech who managed to snap a couple photos of it.
Image credit: Anandtech spotted Samsung's M.2 Z-SSD at OCP Summit 2018.
The new M.2 Z-SSD will come in 240GB and 480GB capacities and sports an 8 channel Phoenix controller. The drive on display at OCP Summit 2018 had a part number of MZ1JB240HMGG-000FB-001. Comparing it to the SZ985 PCI-E SSD, this new M.2 drive appears to also have a DRAM cache as well as capacitors to protect data in the event of power loss (data writes would be able to completely write from the cache to the drive before safe shutdown) though we don't know if this drive has the same 1.5GB of LPDDR4 cache or not. Note that the sticker of the M.2 drive reads SZ983 while Samsung elsewhere had the M.2 labeled as the SZ985 (M.2) so it's unclear which name will stick when this actually launches though hopefully it's the former just to avoid confusion. The Phoenix (formerly Polaris v2) controller is allegedly going to also be used on some of the higher end V-NAND drives though we'll have to wait and see if that happens or not.
Anyway, back to performance numbers, Samsung rates the M.2 Z-SSD at 3200 MB/s sequential reads and 2800 MB/s sequential writes (so a bit slower than the SZ985 at writes). Samsung did not talk random IOPS numbers. The drive is rated at the same 30 DWPD (drive writes per day) endurance rating as the SZ985 and will have the same 5-year warranty. I am curious if the M.2 NVMe drive is able to hit the same (or close to) random IOPS numbers as the PCI-E card which is rated at up to 750,000 read and 170,000 write IOPS.
Z-NAND is interesting as it represents a middle ground between V-NAND and other 3D NAND flash and 3D XPoint memory in both terms of cost and latency performance with Z-NAND being closer in latency to XPoint than V-NAND. Where it gets interesting is that Z-NAND is essentially V-NAND just run at a different mode and yet they are able to reduce write latency by 5-times (12-to-20 microseconds) and cell read latency by up to 10-times (16 microseconds). While Samsung is already working on second generation Z-NAND, these drives are using first generation Z-NAND which is the more performance (lowest latency) type but costs quite a bit more than 2nd generation which is only a bit slower (more read latency). Judging by the form 110mm form factor, this M.2 drive is aimed squarely at datacenter and workstation usage and is not likely to lead to a consumer Optane 800P (et al) competitor, but if it does well enough we may see some prosumer and consumer Z-NAND based options in the future with newer generations of Z-NAND as they get the right balance of cost and latency for the desktop gaming and enthusiast market.
- Samsung Introducing Z-NAND Based 800GB Z-SSD For Enterprise HPC
- FMS 2017: Samsung Announces QLC V-NAND, 16TB NGSFF SSD, Z-SSD V2, Key Value
- Samsung SZ985 Z-NAND SSD - Upcoming Competition for Intel's P4800X?
- Intel Optane SSD 800P 58GB, 118GB, and RAID Review - 3D XPoint Goes Mainstream
Subject: Storage | January 31, 2018 - 08:39 PM | Tim Verry
Tagged: z-ssd, Z-NAND, Samsung, HPC, enterprise, ai
Samsung will be introducing a new high performance solid state drive using new Z-NAND flash at ISSCC next month. The new Samsung SZ985 Z-SSD is aimed squarely at the high-performance computing (HPC) market for big data number crunching, supercomputing, AI research, and IoT application development. The new drive will come in two capacities at 800GB and 240GB and combines low latency Z-NAND flash with 1.5GB LPDDR4 DRAM cache and an unspecified "high performance" Samsung controller.
The Z-NAND drive is interesting because it represents an extremely fast storage solution that offers up to 10-times cell read performance and 5-times less write latency than 3-bit V-NAND based drives such as Samsung's own PM963 NVMe SSD. The Z-NAND technology represents a middle ground (though closer to Optane than not) between NAND and X Point flash memory without the expense and complexity of 3D XPoint (at least, in theory). The single port 4-lane drive (PCI-E x4) reportedly is able to hit random read performance of 750,000 IOPS and random write performance of 170,000 IOPS. The drive is able to do this with very little latency at around 16µs (microseconds). To put that in perspective, a traditional NVMe SSD can exhibit write latencies of around 90+ microseconds while Optane sits at around half the latency of Z-NAND (~8-10µs). You can find a comparison chart of latency percentiles of various storage technologies here. While the press release did not go into transfer speeds or read latencies, Samsung talked about that late last year when it revealed the drive's existence. The SZ985 Z-SSD maxes out its x4 interface at 3.2 GB/s for both sequential reads and sequential writes. Further, read latencies are rated at between 12µs and 20µs. At the time Allyn noted that the 30 drive writes per day (DWPD) matched that of Intel's P4800X and stated that it was an impressive feat considering Samsung is essentially running its V-NAND flash in a different mode with Z-NAND. Looking at the specs, the Samsung SZ985 Z-SSD has the same 2 million hours MTBF but is actually rated higher for endurance at 42 Petabytes over five years (versus 41 PB). Both drives appear to offer the same 5-year warranty though we may have to wait for the ISSCC announcement for confirmation on that.
It appears that the SZ-985 offers a bit more capacity, higher random read IOPS, and better sequential performance but with slightly more latency and lower random write IOPS than the 3D XPoint based Intel Optane P4800X drive.
In all Samsung has an interesting drive and if they can price it right I can see them selling a ton of these drives to the enterprise market for big data analytics tasks as well as a high-speed drive for researchers. I am looking forward to more information being released about the Z-SSD and its Z-NAND flash technology at the ISSCC (International Solid-State Circuits Conference) in mid-February.
Subject: Memory | January 12, 2018 - 05:46 PM | Tim Verry
Tagged: supercomputing, Samsung, HPC, HBM2, graphics cards, aquabolt
Samsung recently announced that it has begun mass production of its second generation HBM2 memory which it is calling “Aquabolt”. Samsung has refined the design of its 8GB HBM2 packages allowing them to achieve an impressive 2.4 Gbps per pin data transfer rates without needing more power than its first generation 1.2V HBM2.
Reportedly Samsung is using new TSV (through-silicon-via) design techniques and adding additional thermal bumps between dies to improve clocks and thermal control. Each 8GB HBM2 “Aquabolt” package is comprised of eight 8Gb dies each of which is vertically interconnected using 5,000 TSVs which is a huge number especially considering how small and tightly packed these dies are. Further, Samsung has added a new protective layer at the bottom of the stack to reinforce the package’s physical strength. While the press release did not go into detail, it does mention that Samsung had to overcome challenges relating to “collateral clock skewing” as a result of the sheer number of TSVs.
On the performance front, Samsung claims that Aquabolt offers up a 50% increase in per package performance versus its first generation “Flarebolt” memory which ran at 1.6Gbps per pin and 1.2V. Interestingly, Aquabolt is also faster than Samsung’s 2.0Gbps per pin HBM2 product (which needed 1.35V) without needing additional power. Samsung also compares Aquabolt to GDDR5 stating that it offers 9.6-times the bandwidth with a single package of HBM2 at 307 GB/s and a GDDR5 chip at 32 GB/s. Thanks to the 2.4 Gbps per pin speed, Aquabolt offers 307 GB/s of bandwidth per package and with four packages products such as graphics cards can take advantage of 1.2 TB/s of bandwidth.
This second generation HBM2 memory is a decent step up in performance (with HBM hitting 128GB/s and first generation HBM2 hitting 256 GB/s per package and 512 GB/s and 1 TB/s with four packages respectively), but the interesting bit is that it is faster without needing more power. The increased bandwidth and data transfer speeds will be a boon to the HPC and supercomputing market and useful for working with massive databases, simulations, neural networks and AI training, and other “big data” tasks.
Aquabolt looks particularly promising for the mobile market though with future products succeeding the current mobile Vega GPU in Kaby Lake-G processors, Ryzen Mobile APUs, and eventually discrete Vega mobile graphics cards getting a nice performance boost (it’s likely too late for AMD to go with this new HBM2 on these specific products, but future refreshes or generations may be able to take advantage of it). I’m sure it will also see usage in the SoCs uses in Intel’s and NVIDIA’s driverless car projects as well.
Subject: General Tech | November 30, 2017 - 12:48 AM | Tim Verry
Tagged: HPC, supercomputer, Raspberry Pi 3, cluster, research, LANL
The Raspberry Pi has been used to build cheap servers and small clusters before, but BitScope is taking the idea to the extreme with a professional enterprise solution. On display at SC17, the BitScope Raspberry Pi Cluster Module is a 6U rackable drawer that holds 144 Raspberry Pi 3 single board computers along with all of the power, networking, and air cooling needed to keep things running smoothly.
Each cluster module holds two and a half BitScope Blades with each BitScope Blade holding up to 60 Raspberry Pi PCs (or other SBCs like the ODROID C2). Enthusiasts can already purchase their own Quattro Pi boards as well as the cluster plate to assemble their own small clusters though the 6U Cluster Module drawer doesn’t appear to be for sale yet (heh). Specifically each Cluster Module has room for 144 active nodes, six spare nodes, and one cluster manager node.
For reference, the Raspberry Pi 3 features the Broadcom BCM2837 SoC with 4 ARM Cortex A53 cores at 1.2 GHz and a VideoCore IV GPU that is paired with 1 GB of LPDDR2 memory at 900 MHz, 100 Mbps Ethernet, 802.11n Wi-Fi and Bluetooth. The ODROID C2 has 4 Amlogic cores at 1.5 GHz, a Mali 450 GPU, 2 GB of DDR3 SDRAM, and Gigabit Ethernet. Interestingly, BitScope claims the Cluster Module uses a 10 Gigabit Ethernet SFP+ backbone which will help when communicating between Cluster Modules but speeds between individual nodes will be limited by at best one gigabit speeds (less in real world, and in the case of the Pi it is much less than the 100 Mbps port rating due to how it is wired to the SoC).
BitScope is currently building a platform for Los Alamos National Laboratory that will feature five Cluster Modules for a whopping 2,880 64-bit ARM cores, 720GB of RAM, and a 10GbE SFP+ fabric backbone. Fully expanded, a 42U server cabinet holds 7 modules (1008 active nodes / 4,032 active cores) and would consume up to 6KW of power. LANL expects their 5 module setup to use around 3000 W on average though.
What is the New Mexico Consortium and LANL planning to do with all these cores? Well, playing Crysis would prove tough even if they could SLI all those GPUs so instead they plan to use the Raspberry Pi-powered system to model much larger and prohibitively expensive supercomputers for R&D and software development. Building out a relatively low cost and low power system enables it to be powered on and accessed by more people including students, researchers, and programmers where they can learn and design software that runs as efficiently as possible on massive multiple core and multiple node systems. Getting software to scale out to hundreds and thousands of different nodes is tricky, especially if you want all the nodes working on the same problem(s) at once. Keeping each node fed with data, communicating amongst themselves, and returning accurate results while keeping latency low and utilization high is a huge undertaking. LANL is hoping that the Raspberry Pi based system will be the perfect testing ground for software and techniques they can then use on the big gun supercomputers like Trinity, Titan, Summit (ORNL, slated for 2018), and other smaller HPC clusters.
It is cool to see how far the Raspberry Pi has come and while I wish the GPU was more open so that the researchers could more easily work with heterogenous HPC coding rather than just working with the thousands of ARM cores, it is still impressive to see what is essentially a small supercomputer with a 1008 node cluster for under $25,000!
I am interested to see how the researchers at Los Alamos put it to work and the eventual improvements to HPC and supercomputing software that come from this budget cluster project!
- Intel Hopes For Exaflop Capable Supercomputers Within 10 Years
- The Next Most Powerful Supercomputer in the U.S. Is Almost Complete
- NVIDIA Launches Tesla K20X Accelerator Card, Powers Titan Supercomputer
- GTC 2013: Pedraforca Is A Power Efficient ARM + GPU Cluster For Homogeneous (GPU) Workloads
Subject: Cases and Cooling | November 20, 2017 - 10:09 PM | Tim Verry
Tagged: Supercomputing Conference, supercomputing, liquid cooling, immersion cooling, HPC, allied control, 3M
PC Gamer Hardware (formerly Maximum PC) spotted a cool immersion cooling system being shown off at the SuperComputing conference in Denver, Colorado earlier this month. Allied Control who was recently acquired by BitFury (popular for its Bitcoin mining ASICs) was at the show with a two phase immersion cooling system that takes advantage of 3M's Novec fluid and a water cooled condesor coil to submerge and cool high end and densely packed hardware with no moving parts and no pesky oil residue.
Nick Knupffer (@Nick_Knupffer) posted a video (embedded below) of the cooling system in action cooling a high end processor and five graphics cards. The components are submerged in a non-flamable, non-conductive fluid that has a very low boiling point of 41°C. Interestingly, the heatsinks and fans are removed allowing for direct contact between the fluid and the chips (in this case there is a copper baseplate on the CPU but bare ASICs can also be cooled). When the hardware is in use, heat is transfered to the liquid which begins to boil off from a liquid to a vapor / gaseous state. The vapor rises to the surface and hits a condensor coil (which can be water cooled) that cools the gas until it turns back into a liquid and falls back into the tank. The company has previously shown off an overclocked 20 GPU (250W) plus dual Xeon system that was able to run flat out (The GPUs at 120% TDP) running deep learning as well as mining Z-Cash when not working on HPC projects while keeping all the hardware well under thermal limits and not throttling. Cnet also spotted a 10 GPU system being shown off at Computex (warning autoplay video ad!).
According to 3M, two phase immersion cooling is extremely efficient (many times more than air or even water) and can enable up to 95% lower energy cooling costs versus conventional air cooling. Further, hardware can be packed much more tightly with up to 100kW/square meter versus 10kW/sq. m with air meaning immersion cooled hardware can take up to 10% less floor space and the heat produced can be reclaimed for datacenter building heating or other processes.
— Nick Knupffer (@Nick_Knupffer) November 14, 2017
Neat stuff for sure even if it is still out of the range of home gaming PCs and mining rigs for now! Speaking of mining BitFury plans to cool a massive 40+ MW ASIC mining farm in the Republic of Georgia using an Allied Control designed immersion cooling system (see links below)!
- Two-Phase Immersion Cooling A revolution in data center efficiency @ 3M [PDF]
- 3M, Orange Silicon Valley, Allied Control and U.S. Naval Research Laboratory Demonstrate High-Density Supercomputing at SC'17 @ 3M
- Revolutionary project built by BitFury and Allied Control to cool 40+ MW of ASIC clusters [PDF]
- Oil cooling: Deep fried, or deep energy savings? @ ExtremeTech
Subject: General Tech | August 9, 2017 - 12:43 PM | Jeremy Hellstrom
Tagged: nvidia, autonomous vehicles, HPC
NVIDIA has previously shown their interest in providing the brains for autonomous vehicles, their Xavier chip is scheduled for release some time towards the end of the year. They are continuing their efforts to break into this market by investing in start ups in a program called GPU Ventures. Today DigiTimes reports that NVIDIA purchased a stake in a Chinese company called Tusimple which is developing autonomous trucks. The transportation of goods may not be as interesting to the average consumer as self driving cars but the market could be more lucrative; there are a lot of trucks on the roads of the world and they are unlikely to be replaced any time soon.
"Tusimple, a Beijing-based startup focused on developing autonomous trucks, has disclosed that Nvidia will make a strategic investment to take a 3% stake in the company. Nvidia's investment is part of a a Series B financing round, Tusimple indicated."
Here is some more Tech News from around the web:
- Microsoft launches Outlook.com beta because it's not Gmail or Yahoo @ The Inquirer
- Intel will unveil 8th-gen 'Coffee Lake' processors on 21 August @ The Inquirer
- Microsoft Dumps Notorious Chinese Secure Certificate Vendor @ Slashdot
- It's 2017 and Hyper-V can be pwned by a guest app, Windows by a search query, Office by... @ The Register
- Core-blimey! Intel's Core i9 18-core monster – the numbers @ The Register