Subject: Cases and Cooling | November 20, 2017 - 10:09 PM | Tim Verry
Tagged: Supercomputing Conference, supercomputing, liquid cooling, immersion cooling, HPC, allied control, 3M
PC Gamer Hardware (formerly Maximum PC) spotted a cool immersion cooling system being shown off at the SuperComputing conference in Denver, Colorado earlier this month. Allied Control who was recently acquired by BitFury (popular for its Bitcoin mining ASICs) was at the show with a two phase immersion cooling system that takes advantage of 3M's Novec fluid and a water cooled condesor coil to submerge and cool high end and densely packed hardware with no moving parts and no pesky oil residue.
Nick Knupffer (@Nick_Knupffer) posted a video (embedded below) of the cooling system in action cooling a high end processor and five graphics cards. The components are submerged in a non-flamable, non-conductive fluid that has a very low boiling point of 41°C. Interestingly, the heatsinks and fans are removed allowing for direct contact between the fluid and the chips (in this case there is a copper baseplate on the CPU but bare ASICs can also be cooled). When the hardware is in use, heat is transfered to the liquid which begins to boil off from a liquid to a vapor / gaseous state. The vapor rises to the surface and hits a condensor coil (which can be water cooled) that cools the gas until it turns back into a liquid and falls back into the tank. The company has previously shown off an overclocked 20 GPU (250W) plus dual Xeon system that was able to run flat out (The GPUs at 120% TDP) running deep learning as well as mining Z-Cash when not working on HPC projects while keeping all the hardware well under thermal limits and not throttling. Cnet also spotted a 10 GPU system being shown off at Computex (warning autoplay video ad!).
According to 3M, two phase immersion cooling is extremely efficient (many times more than air or even water) and can enable up to 95% lower energy cooling costs versus conventional air cooling. Further, hardware can be packed much more tightly with up to 100kW/square meter versus 10kW/sq. m with air meaning immersion cooled hardware can take up to 10% less floor space and the heat produced can be reclaimed for datacenter building heating or other processes.
— Nick Knupffer (@Nick_Knupffer) November 14, 2017
Neat stuff for sure even if it is still out of the range of home gaming PCs and mining rigs for now! Speaking of mining BitFury plans to cool a massive 40+ MW ASIC mining farm in the Republic of Georgia using an Allied Control designed immersion cooling system (see links below)!
- Two-Phase Immersion Cooling A revolution in data center efficiency @ 3M [PDF]
- 3M, Orange Silicon Valley, Allied Control and U.S. Naval Research Laboratory Demonstrate High-Density Supercomputing at SC'17 @ 3M
- Revolutionary project built by BitFury and Allied Control to cool 40+ MW of ASIC clusters [PDF]
- Oil cooling: Deep fried, or deep energy savings? @ ExtremeTech
Subject: General Tech | August 9, 2017 - 12:43 PM | Jeremy Hellstrom
Tagged: nvidia, autonomous vehicles, HPC
NVIDIA has previously shown their interest in providing the brains for autonomous vehicles, their Xavier chip is scheduled for release some time towards the end of the year. They are continuing their efforts to break into this market by investing in start ups in a program called GPU Ventures. Today DigiTimes reports that NVIDIA purchased a stake in a Chinese company called Tusimple which is developing autonomous trucks. The transportation of goods may not be as interesting to the average consumer as self driving cars but the market could be more lucrative; there are a lot of trucks on the roads of the world and they are unlikely to be replaced any time soon.
"Tusimple, a Beijing-based startup focused on developing autonomous trucks, has disclosed that Nvidia will make a strategic investment to take a 3% stake in the company. Nvidia's investment is part of a a Series B financing round, Tusimple indicated."
Here is some more Tech News from around the web:
- Microsoft launches Outlook.com beta because it's not Gmail or Yahoo @ The Inquirer
- Intel will unveil 8th-gen 'Coffee Lake' processors on 21 August @ The Inquirer
- Microsoft Dumps Notorious Chinese Secure Certificate Vendor @ Slashdot
- It's 2017 and Hyper-V can be pwned by a guest app, Windows by a search query, Office by... @ The Register
- Core-blimey! Intel's Core i9 18-core monster – the numbers @ The Register
Subject: Graphics Cards | June 27, 2017 - 06:51 PM | Jeremy Hellstrom
Tagged: Vega FE, Vega, HPC, amd
AMD have released their new HPC card, the Radeon Vega Frontier Edition, which Jim told you about earlier this week. The air cooled version is available now, with an MSRP of $999USD followed by a water-cooled edition arriving in Q3 with price tag of $1499.
The specs they list for the cards are impressive and compare favourably to NVIDIA's P100 which is the card AMD tested against, offering higher TFLOPS for both FP32 and FP16 operations though the memory bandwidth lags a little behind.
|Peak/Boost Clock||1600 MHz||1442 MHz|
|FP32 TFLOPS (SP)||13.1||10.3|
|FP64 TFLOPS (DP)||
|Memory Interface||1.89 Gb/s
|Memory Bandwidth||483 GB/s||716 GB/s|
|Memory Size||16GB HBC*||16GB|
|TDP||300 W air, 375 W water||235 W|
The memory size for the Vega is interesting, HBC is AMDs High Bandwidth Cache Controller which not only uses the memory cache more effectively but is able to reach out to other high performance system memory for help. AMD states that the Radeon Vega Frontier Edition has the capability of expanding traditional GPU memory to 256TB; perhaps allowing new texture mods for Skyrim or Fallout! Expect to see more detail on this feature once we can get our hands on a card to abuse, nicely of course.
AMD used the DeepBench Benchmark to provide comparative results, the AMD Vega FE system used a dual socketed system with Xeon E5 2640v4s @ 2.4Ghz 10C/20T, 32GB DDR4 per socket, on Ubuntu 16.04 LTS with ROCm 1.5, and OpenCL 1.2, the NVIDIA Tesla P100 system used the same hardware with the CuDNN 5.1, Driver 375.39 and Cuda version 8.0.61 drivers. Those tests showed the AMD system completing the benchmark in 88.7ms, the Tesla P100 completed in 133.1 ms, quite an impressive lead for AMD. Again, there will be much more information on performance once the Vega FE can be tested.
Read on to hear about the new card in AMD's own words, with links to their sites.
Subject: General Tech, Graphics Cards | May 27, 2017 - 12:18 AM | Tim Verry
Tagged: vision fund, softbank, nvidia, iot, HPC, ai
SoftBank, the Tokyo, Japan based Japanese telecom and internet technology company has reportedly quietly amassed a 4.9% stake in graphics chip giant NVIDIA. Bloomberg reports that SoftBank has carefully invested $4 billion into NVIDIA avoiding the need to get regulatory approval in the US by keeping its investment under 5% of the company. SoftBank has promised the current administration that it will invest $50 billion into US tech companies and it seems that NVIDIA is the first major part of that plan.
NVIDIA's Tesla V100 GPU.
Led by Chairman and CEO Masayoshi Son, SoftBank is not afraid to invest in technology companies it believes in with major past acquisitions and investments in companies like ARM Holdings, Sprint, Alibaba, and game company Supercell.
The $4 billion-dollar investment makes SoftBank the fourth largest shareholder in NVIDIA, which has seen the company’s stock rally from SoftBank’s purchases and vote of confidence. The (currently $93) $100 billion Vision Fund may also follow SoftBank’s lead in acquiring a stake in NVIDIA which is involved in graphics, HPC, AI, deep learning, and gaming.
Overall, this is good news for NVIDIA and its shareholders. I am curious what other plays SoftBank will make for US tech companies.
What are your thoughts on SoftBank investing heavily in NVIDIA?
Subject: General Tech, Memory, Storage | May 26, 2017 - 10:14 PM | Tim Verry
Tagged: XPoint, Intel, HPC, DIMM, 3D XPoint
Intel recently teased a bit of new information on its 3D XPoint DIMMs and launched its first public demonstration of the technology at the SAP Sapphire conference where SAP’s HANA in-memory data analytics software was shown working with the new “Intel persistent memory.” Slated to arrive in 2018, the new Intel DIMMs based on the 3D XPoint technology developed by Intel and Micron will work in systems alongside traditional DRAM to provide a pool of fast, low latency, and high density nonvolatile storage that is a middle ground between expensive DDR4 and cheaper NVMe SSDs and hard drives. When looking at the storage stack, the storage density increases along with latency as it gets further away from the CPU. The opposite is also true, as storage and memory gets closer to the processor, bandwidth increases, latency decreases, and costs increase per unit of storage. Intel is hoping to bridge the gap between system DRAM and PCI-E and SATA storage.
According to Intel, system RAM offers up 10 GB/s per channel and approximately 100 nanoseconds of latency. 3D XPoint DIMMs will offer 6 GB/s per channel and about 250 nanoseconds of latency. Below that is the 3D XPoint-based NVMe SSDs (e.g. Optane) on a PCI-E x4 bus where they max out the bandwidth of the bus at ~3.2 GB/s and 10 microseconds of latency. Intel claims that non XPoint NVMe NAND solid state drives have around 100 microsecomds of latency, and of course, it gets worse from there when you go to NAND-based SSDs or even hard drives hanging of the SATA bus.
Intel’s new XPoint DIMMs have persistent storage and will offer more capacity that will be possible and/or cost effective with DDR4 DRAM. In giving up some bandwidth and latency, enterprise users will be able to have a large pool of very fast storage for storing their databases and other latency and bandwidth sensitive workloads. Intel does note that there are security concerns with the XPoint DIMMs being nonvolatile in that an attacker with physical access could easily pull the DIMM and walk away with the data (it is at least theoretically possible to grab some data from RAM as well, but it will be much easier to grab the data from the XPoint sticks. Encryption and other security measures will need to be implemented to secure the data, both in use and at rest.
Interestingly, Intel is not positioning the XPoint DIMMs as a replacement for RAM, but instead as a supplement. RAM and XPoint DIMMs will be installed in different slots of the same system and the DDR4 RAM will be used for the OS and system critical applications while the XPoint pool of storage will be used for storing data that applications will work on much like a traditional RAM disk but without needing to load and save the data to a different medium for persistent storage and offering a lot more GBs for the money.
While XPoint is set to arrive next year along with Cascade Lake Xeons, it will likely be a couple of years before the technology takes off. Supporting it is going to require hardware and software support for the workstations and servers as well as developers willing to take advantage of it when writing their specialized applications. Fortunately, Intel started shipping the memory modules to its partners for testing earlier this year. It is an interesting technology and the DIMM solution and direct CPU interface will really let the 3D XPoint memory shine and reach its full potential. It will primarily be useful for the enterprise, scientific, and financial industries where there is a huge need for faster and lower latency storage that can accommodate massive (multiple terabyte+) data sets that continue to get larger and more complex. It is a technology that likely will not trickle down to consumers for a long time, but I will be ready when it does. In the meantime, I am eager to see what kinds of things it will enable the big data companies and researchers to do! Intel claims it will not only be useful at supporting massive in-memory databases and accelerating HPC workloads but for things like virtualization, private clouds, and software defined storage.
What are your thoughts on this new memory tier and the future of XPoint?
- Intel Has Started Shipping Optane Memory Modules
- Intel Optane Memory 32GB Review - Faster Than Lightning
- A Closer Look at Intel's Optane SSD DC P4800X Enterprise SSD Performance
Subject: Processors | March 7, 2017 - 09:02 AM | Tim Verry
Tagged: SoC, server, ryzen, opteron, Naples, HPC, amd
Over the summer, AMD introduced its Naples platform which is the server-focused implementation of the Zen microarchitecture in a SoC (System On a Chip) package. The company showed off a prototype dual socket Naples system and bits of information leaked onto the Internet, but for the most part news has been quiet on this front (whereas there were quite a few leaks of Ryzen which is AMD's desktop implementation of Zen).
The wait seems to be finally over, and AMD appears ready to talk more about Naples which will reportedly launch in the second quarter of this year (Q2'17) with full availability of processors and motherboards from OEMs and channel partners (e.g. system integrators) happening in the second half of 2017. Per AMD, "Naples" processors are SoCs with 32 cores and 64 threads that support 8 memory channels and a (theoretical) maximum of 2TB DDR4-2667. (Using the 16GB DIMMs available today, Naples support 256GB of DDR4 per socket.) Further, the Naples SoC features 64 PCI-E 3.0 lanes. Rumors also indicated that the SoC included support for sixteen 10GbE interfaces, but AMD has yet to confirm this or the number of SATA/SAS ports offered. AMD did say that Naples has an optimized cache structure for HPC compute and "dedicated security hardware" though it did not go into specifics. (The security hardware may be similar to the ARM TrustZone technology it has used in the past.)
Naples will be offered in single and dual socket designs with dual socket systems offering up 64 cores, 128 threads, 32 DDR4 DIMMs (512 GB using 16 GB modules) on 16 total memory channels with 21.3 GB/s per channel bandwidth (170.7 GB/s per SoC), 128 PCI-E 3.0 lanes, and an AMD Infinity Fabric interconnect between the two processor sockets.
AMD claims that its Naples platform offers up to 45% more cores, 122% more memory bandwidth, and 60% more I/O than its competition. For its internal comparison, AMD chose the Intel Xeon E5-2699A V4 which is the processor with highest core count that is intended for dual socket systems (there are E7s with more cores but those are in 4P systems). The Intel Xeon E5-2699A V4 system is a 14nm 22 core (44 thread) processor clocked at 2.4 GHz base to 3.6 GHz turbo with 55MB cache. It supports four channels of DDR4-2400 for a maximum bandwidth of 76.8 GB/s (19.2 GB/s per channel) as well as 40 PCI-E 3.0 lanes. A dual socket system with two of those Xeons features 44 cores, 88 threads, and a theoretical maximum of 1.54 TB of ECC RAM.
AMD's reference platform with two 32 core Naples SoCs and 512 GB DDR4 2400 MHz was purportedly 2.5x faster at the seismic analysis workload than the dual Xeon E5-2699A V4 OEM system with 1866 MHz DDR4. Curiously, when AMD compared a Naples reference platform with 44 cores enabled and running 1866 MHz memory to a similarly configured Intel system the Naples platform was twice as fast. It seems that the increased number of memory channels and memory bandwidth are really helping the Naples platform pull ahead in this workload.
AMD further claims that its Naples platform is more balanced and suited to cloud computing and scientific and HPC workloads than the competition. Specifically, Forrest Norrod the Senior Vice president and General Manager of AMD's Enterprise, Embedded, and Semi-Custom Business Unit stated:
“’Naples’ represents a completely new approach to supporting the massive processing requirements of the modern datacenter. This groundbreaking system-on-chip delivers the unique high-performance features required to address highly virtualized environments, massive data sets and new, emerging workloads.”
There is no word on pricing yet, but it should be competitive with Intel's offerings (the E5-2699A V4 is $4,938). AMD will reportedly be talking data center strategy and its upcoming products during the Open Compute Summit later this week, so hopefully there will be more information released at those presentations.
(My opinions follow)
This is one area where AMD needs to come out strong with support from motherboard manufacturers, system integrators, OEM partners, and OS and software validation to succeed. Intel is not likely to take AMD encroaching on its lucrative server market share lightly, and AMD is going to have a long road ahead of it to regain the market share it once had in this area, but it does have a decent architecture on its hands to build off of with Zen and if it can secure partner support Intel is certainly going to have competition here that it has not had to face in a long time. Intel and AMD competing over the data center market is a good thing, and as both companies bring new technology to market it will trickle down into the consumer level hardware. Naples' success in the data center could mean a profitable AMD with R&D money to push Zen as far as it can – so hopefully they can pull it off.
What are your thoughts on the Naples SoC and AMD's push into the server market?
- Zen and the Art of CPU Design
- AMD Zen Architecture Overview: Focus on Ryzen
- Dissecting AMD Zen Architecture - Interview with David Kanter
Subject: General Tech | November 13, 2016 - 06:24 PM | Scott Michaud
Tagged: optical computing, HPC
We occasionally discuss photonic computers as news is announced, because we're starting to reach “can count the number of atoms with fingers and toes” sizes of features. For instance, we reported on a chip made by University of Colorado Boulder and UC Berkeley that had both electric and photonic integrated circuits on it.
This announcement from Optalysys is completely different.
The Optalysys GENESYS is a PCIe add-in board that is designed to accelerate certain tasks. For instance, light is fourier transformed when it passes through a lens, and reverse fourier transformed when it is refocused by a second lens. When I was taking fourth-year optics back in 2009, our professor mentioned that scientists used this trick to solve fourier transforms by flashing light through a 2D pattern, passing through a lens, and being projected upon film. This image was measured pixel by pixel, with each intensity corresponding to the 2D fourier transform's value of the original pattern. Fourier transforms are long processes to solve algebraically, especially without modern computers, so this was a huge win; you're solving a 2D grid of values in a single step.
These are the sort of tricks that the Optalysys GENESYS claims to use. They claim that this will speed up matrix multiplications, convolutions (fourier transforms -- see previous paragraph), and pattern recognition (such as for DNA sequencing). Matrix multiplications is a bit surprising to me, because it's not immediately clear how you can abuse light dynamics to calculate this, but someone who has more experience in this field will probably say “Scott, you dummy, we've been doing this since the 1800s” or something.
Image Credit: Tom Roelandts
The circles of the filter (center) correspond to the frequencies it blocks or permits.
The frequencies correspond to how quick an image changes.
This is often used for noise reduction or edge detection, but it's just a filter in fourier space.
You could place it between two lenses to modify the image in that way.
From a performance standpoint, their “first demonstrator system” operated at 20Hz with 500x500 resolution. However, their video claims they expect to have a “PetaFLOP-equivalent co-processor” by the end of the 2017. For comparison, modern GPUs are just barely in the 10s of TeraFLOPs, but that's about as useful as comparing a CPU core to a digital signal processor (DSP). (I'm not saying this is analogous to a DSP, but performance comparisons are about as useful.)
Optalysys expects to have a 1 PetaFLOP co-processor available by the end of the year.
Subject: General Tech | September 14, 2016 - 01:06 PM | Jeremy Hellstrom
Tagged: pascal, tesla, p40, p4, nvidia, neural net, m40, M4, HPC
The Register have package a nice explanation of the basics of how neural nets work in their quick look at NVIDIA's new Pascal based HPC cards, the P4 and P40. The tired joke about Zilog or Dick Van Patten stems from the research which has shown that 8-bit precision is most effective when feeding data into a neural net. Using 16 or 32-bit values slows the processing down significantly while adding little precision to the results produced. NVIDIA is also perfecting a hybrid mode, where you can opt for a less precise answer produced by your local, presumably limited, hardware or you can upload the data to the cloud for the full treatment. This is great for those with security concerns or when a quicker answer is more valuable than a more accurate one.
As for the hardware, NVIDIA claims the optimizations on the P40 will make it "40 times more efficient" than an Intel Xeon E5 CPU and it will also provide slightly more throughput than the currently available Titan X. You can expect to see these arrive in the market sometime over then next two months.
"Nvidia has designed a couple of new Tesla processors for AI applications – the P4 and the P40 – and is talking up their 8-bit math performance. The 16nm FinFET GPUs use Nv's Pascal architecture and follow on from the P100 launched in June. The P4 fits on a half-height, half-length PCIe card for scale-out servers, while the beefier P40 has its eyes set on scale-up boxes."
Here is some more Tech News from around the web:
- Windows 10 Anniversary Update might not arrive on your PC until November @ The Inquirer
- iOS 10 reviewed: There’s no reason not to update @ Ars Technica
- iOS 10 rollout goes titsup as update 'bricks' iPhones and iPads @ The Inqurier
- DevOps and the Art of Secure Application Deployment @ Linux.com
- HTC to unveil new Desire smartphones on September 20 @ DigiTimes
- Using a thing made by Microsoft, Apple or Adobe? It probably needs a patch today @ The Register
- New York Fines Viacom, Mattel and Hasbro For Tracking Kids Online @ Slashdot
- Microsoft's Service Fabric for Linux hits public preview @ The Register
Subject: General Tech | August 17, 2016 - 12:41 PM | Jeremy Hellstrom
Tagged: nvidia, Intel, HPC, Xeon Phi, maxwell, pascal, dirty pool
There is a spat going on between Intel and NVIDIA over the slide below, as you can read about over at Ars Technica. It seems that Intel have reached into the industries bag of dirty tricks and polished off an old standby, testing new hardware and software against older products from their competitors. In this case it was high performance computing products which were tested, Intel's new Xeon Phi against NVIDIA's Maxwell, tested on an older version of the Caffe AlexNet benchmark.
NVIDIA points out that not only would they have done better than Intel if an up to date version of the benchmarking software was used, but that the comparison should have been against their current architecture, Pascal. This is not quite as bad as putting undocumented flags into compilers to reduce the performance of competitors chips or predatory discount programs but it shows that the computer industry continues to have only a passing acquaintance with fair play and honest competition.
"At this juncture I should point out that juicing benchmarks is, rather sadly, par for the course. Whenever a chip maker provides its own performance figures, they are almost always tailored to the strength of a specific chip—or alternatively, structured in such a way as to exacerbate the weakness of a competitor's product."
Here is some more Tech News from around the web:
- USB Implementers Forum introduces branding for safe USB-C charging @ The Inquirer
- Some Windows 10 Anniversary Update: SSD freeze @ The Register
- Intel Project Alloy: all-in-one VR headset takes aim at Google's Project Daydream @ The Inquirer
- Wanna build your own drone? Intel emits Linux-powered x86 brains for DIY flying gizmos @ The Register
- Intel's Optane XPoint DIMMs pushed back – source @ The Register
93% of a GP100 at least...
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.