Subject: General Tech | November 30, 2017 - 12:48 AM | Tim Verry
Tagged: HPC, supercomputer, Raspberry Pi 3, cluster, research, LANL
The Raspberry Pi has been used to build cheap servers and small clusters before, but BitScope is taking the idea to the extreme with a professional enterprise solution. On display at SC17, the BitScope Raspberry Pi Cluster Module is a 6U rackable drawer that holds 144 Raspberry Pi 3 single board computers along with all of the power, networking, and air cooling needed to keep things running smoothly.
Each cluster module holds two and a half BitScope Blades with each BitScope Blade holding up to 60 Raspberry Pi PCs (or other SBCs like the ODROID C2). Enthusiasts can already purchase their own Quattro Pi boards as well as the cluster plate to assemble their own small clusters though the 6U Cluster Module drawer doesn’t appear to be for sale yet (heh). Specifically each Cluster Module has room for 144 active nodes, six spare nodes, and one cluster manager node.
For reference, the Raspberry Pi 3 features the Broadcom BCM2837 SoC with 4 ARM Cortex A53 cores at 1.2 GHz and a VideoCore IV GPU that is paired with 1 GB of LPDDR2 memory at 900 MHz, 100 Mbps Ethernet, 802.11n Wi-Fi and Bluetooth. The ODROID C2 has 4 Amlogic cores at 1.5 GHz, a Mali 450 GPU, 2 GB of DDR3 SDRAM, and Gigabit Ethernet. Interestingly, BitScope claims the Cluster Module uses a 10 Gigabit Ethernet SFP+ backbone which will help when communicating between Cluster Modules but speeds between individual nodes will be limited by at best one gigabit speeds (less in real world, and in the case of the Pi it is much less than the 100 Mbps port rating due to how it is wired to the SoC).
BitScope is currently building a platform for Los Alamos National Laboratory that will feature five Cluster Modules for a whopping 2,880 64-bit ARM cores, 720GB of RAM, and a 10GbE SFP+ fabric backbone. Fully expanded, a 42U server cabinet holds 7 modules (1008 active nodes / 4,032 active cores) and would consume up to 6KW of power. LANL expects their 5 module setup to use around 3000 W on average though.
What is the New Mexico Consortium and LANL planning to do with all these cores? Well, playing Crysis would prove tough even if they could SLI all those GPUs so instead they plan to use the Raspberry Pi-powered system to model much larger and prohibitively expensive supercomputers for R&D and software development. Building out a relatively low cost and low power system enables it to be powered on and accessed by more people including students, researchers, and programmers where they can learn and design software that runs as efficiently as possible on massive multiple core and multiple node systems. Getting software to scale out to hundreds and thousands of different nodes is tricky, especially if you want all the nodes working on the same problem(s) at once. Keeping each node fed with data, communicating amongst themselves, and returning accurate results while keeping latency low and utilization high is a huge undertaking. LANL is hoping that the Raspberry Pi based system will be the perfect testing ground for software and techniques they can then use on the big gun supercomputers like Trinity, Titan, Summit (ORNL, slated for 2018), and other smaller HPC clusters.
It is cool to see how far the Raspberry Pi has come and while I wish the GPU was more open so that the researchers could more easily work with heterogenous HPC coding rather than just working with the thousands of ARM cores, it is still impressive to see what is essentially a small supercomputer with a 1008 node cluster for under $25,000!
I am interested to see how the researchers at Los Alamos put it to work and the eventual improvements to HPC and supercomputing software that come from this budget cluster project!
- Intel Hopes For Exaflop Capable Supercomputers Within 10 Years
- The Next Most Powerful Supercomputer in the U.S. Is Almost Complete
- NVIDIA Launches Tesla K20X Accelerator Card, Powers Titan Supercomputer
- GTC 2013: Pedraforca Is A Power Efficient ARM + GPU Cluster For Homogeneous (GPU) Workloads
Subject: General Tech | November 25, 2016 - 07:47 PM | Scott Michaud
Tagged: Japan, supercomputer
According to Reuters, Japan’s Ministry of Economy, Trade and Industry have set aside 19.5 billion yen to build a high-end supercomputer. This will translate into 130 PetaFLOPs, which would put it ahead of all other announced clusters. The article claims that the government will rent the computer out to Japanese corporations, many of which currently use American-based cloud services.
The supercomputer has been named ABCI: AI Bridging Cloud Infrastructure.
Image Credit: つ via Wikipedia
From a hardware standpoint? There’s not a whole lot else to say about it. The money has been set aside, but no-one has been selected to build it. Companies will submit their bids by December 8th, and we assume they’ll make an announcement at some point after.
This also means we don’t know what is planned to go into each node. Despite targeting ABCI at AI, Japan is sticking to the “FLOPs” rating, and thus will probably be focused on floating-point workloads. It would be weird to see such an expensive machine be focused on 8- or 16-bit instructions, but then we see Google creating custom ASICs, called TPUs, that seem to get huge performance boosts by sticking to low-precision workloads. Could that even scale to a competitive supercomputer? Or would it cut out too many potential customers that need 32- and 64-bit precision?
Either way, I would guess that this computer will use more conventional, GPU-style co-processors from someone like Intel (Xeon Phi) or NVIDIA. Really, we don’t know, though. No-one does at this point. It’s an interesting branding, though.
Subject: General Tech | October 6, 2016 - 11:37 PM | Tim Verry
Tagged: supercomputer, microsoft, deep neural network, azure, artificial intelligence, ai
Microsoft recently announced it would be restructuring 5,000 employees as it focuses its efforts on artificial intelligence with a new AI and Research Group. The Redmond giant is pulling computer scientists and engineers from Microsoft Research, the Information Platfrom, Bing, and Cortana groups, and the Ambient Computing and Robotics teams. Led by 20 year Microsoft veteran Harry Shum (who has worked in both research and engineering roles at Microsoft), the new AI team promises to "democratize AI" and be a leader in the field with intelligent products and services.
It seems that "democratizing AI" is less about free artificial intelligence and more about making the technology accessible to everyone. The AI and Research Group plans to develop artificial intelligence to the point where it will change how humans interact with their computers (read: Cortana 2.0) with services and commands being conversational rather than strict commands, new applications baked with AI such as office and photo editors that are able to proof read and suggest optimal edits respectively, and new vision, speech, and machine analytics APIs that other developers will be able to harness for their own applications. (Wow that's quite the long sentence - sorry!)
Further, Microsoft wants to build the world's fastest AI supercomputer using its Azure cloud computing service. The Azure-powered AI will be available to everyone for their applications and research needs (for a price, of course!). Microsoft certainly has the money, brain power, and computing power to throw at the problem, and this may be one of the major areas where looking to "the cloud" for a company's computing needs is a smart move as the up front capital needed for hardware, engineers, and support staff to do something like this in-house would be extremely prohibative. It remains to be seen whether Microsoft will win out in the wake of competitors at being the first, but it is certainly staking its claim and does not want to be left out completely.
“Microsoft has been working in artificial intelligence since the beginning of Microsoft Research, and yet we’ve only begun to scratch the surface of what’s possible,” said Shum, executive vice president of the Microsoft AI and Research Group. “Today’s move signifies Microsoft’s commitment to deploying intelligent technology and democratizing AI in a way that changes our lives and the world around us for the better. We will significantly expand our efforts to empower people and organizations to achieve more with our tools, our software and services, and our powerful, global-scale cloud computing capabilities.”
Interestingly, this announcement comes shortly after a previous announcement that industry giants Amazon, Facebook, Google-backed DeepMind, IBM, and Microsoft founded the not-for-profit Partnership On AI organization that will collaborate and research best practices on AI development and exploitation (and hopefully how to teach them not to turn on us heh).
I am looking forward to the future of AI and the technologies it will enable!
Subject: General Tech | November 13, 2014 - 01:39 PM | Jeremy Hellstrom
Tagged: cycle computing, supercomputer, gojira, bargain
While the new Gojira supercomputer is not more powerful than the University of Tokyo's Oakleaf-FX at 1 petaflop of performance if you look at it from a price to performance ratio the $5,500 Gojira is more than impressive. It has a peak theoretical performance of 729 teraflops by using over 71,000 Ivy Bridge cores across several Amazon Web Service regions and providing the equivalent of 70.75 years of compute time. The cluster was built in an incredibly short time, going from zero to 50,000 cores in 23 minutes and hitting the peak after 60 minutes. You won't be playing AC Unity on it any time soon but if you want to rapidly test virtual prototypes these guys can do it for an insanely low price. Catch more at The Register and ZDNet, the Cycle Computing page seems to be down for the moment.
"Cycle Computing has helped hard drive giant Western Digital shove a month's worth of simulations into eight hours on Amazon cores."
Here is some more Tech News from around the web:
- Mac OS X Yosemite disables third-party SSD driver support @ The Inquirer
- Digitimes Research: Lenovo, Asustek to launch US$149 Chromebook
- DAY ZERO, and COUNTING: EVIL 'UNICORN' all-Windows vuln - are YOU patched? @ The Register
- FLASH better than DISK for archiving, say academics. Are they stark, raving mad? @ The Register
Subject: General Tech | March 16, 2014 - 10:27 PM | Sebastian Peak
Tagged: supercomputer, solid state drive, NSF, flash memory
We know that SSD's help any system perform better by reducing the storage bottlenecks we all experienced from hard disk drives. But how far can flash storage go in increasing performance if money is no object?? Enter the multi-million dollar world of supercomputers. Historically supercomputers have relied on the addition of more CPU cores to increase performance, but two new system projects funded by the National Science Foundation (NSF) will try a different approach: obscene amounts of high-speed flash storage!
The news comes as the NSF is requesting a cool $7 billion in research money for 2015, and construction has apparently already begun on two new storage-centered supercomputers. Memory and high-speed flash storage arrays will be loaded on the Wrangler supercomputer at Texas Advanced Computing Center (TACC), and the Comet supercomputer at the San Diego Supercomputer Center (SDSC).
Check out the crazy numbers from the TACC's Wrangler: a combination of 120 servers, each with Haswell-based Xeon CPU's, and a total of 10 petabytes (10,000TB!) of high performance flash data storage. The NSF says the supercomputer will have 3,000 processing cores dedicated to data analysis, with flash storage layers for analytics. The Wrangler supercomputer's bandwidth is said to be 1TB/s, with 275 million IOPS! By comparison, the Comet supercomputer will have “only” 1,024 Xeon CPU cores, with a 7 petabyte high-speed flash storage array. (Come on, guys... That’s like, wayyy less bytes.)
Supercomputer under construction…probably (Image credit CBS/Paramount)
The supercomputers are part of the NSF's “Extreme Digital” (XD) research program, and their current priorities are "relevant to the problems faced in computing today”. Hmm, kind of makes you want to run a big muilti-SSD deathwish RAID, huh?
Subject: General Tech, Processors, Systems | June 26, 2013 - 10:27 PM | Scott Michaud
Tagged: supercomputing, supercomputer, titan, Xeon Phi
The National Supercomputer Center in Guangzho, China, will host the the world's fastest supercomputer by the end of the year. The Tianhe-2, English: "Milky Way-2", is capable of nearly double the floating-point performance of Titan albeit with slightly less performance per watt. The Tianhe-2 was developed by China's National University of Defense Technology.
Photo Credit: Top500.org
Comparing new fastest computer with the former, China's Milky Way-2 is able to achieve 33.8627 PetaFLOPs of calculations from 17.808 MW of electricity. The Titan, on the other hand, is able to crunch 17.590 PetaFLOPs with a draw of just 8.209 MW. As such, the new Milky Way-2 uses 12.7% more power per FLOP than Titan.
Titan is famously based on the Kepler GPU architecture from NVIDIA, coupled with several 16-core AMD Opteron server processors clocked at 2.2 GHz. This concept of using accelerated hardware carried over into the design of Tianhe-2, which is based around Intel's Xeon Phi coprocessor. If you include the simplified co-processor cores of the Xeon Phi, the new champion is the sum of 3.12 million x86 cores and 1024 terabytes of memory.
... but will it run Crysis?
... if someone gets around to emulating DirectX in software, it very well could.
Subject: Systems | June 3, 2013 - 09:27 PM | Tim Verry
Tagged: Xeon Phi, tianhe-2, supercomputer, Ivy Bridge, HPC, China
A powerful new supercomputer constructed by Chinese company Inspur is currently in testing at the National University of Defense Technology. Called the Tianhe-2, the new supercomputer has 16,000 compute nodes and approximately 54 Petaflops of peak theoretical compute performance.
Destined for the National Supercomputer Center in Guangzhou, China, the open HPC platform will be used for education and research projects. The Tianhe-2 is composed of 125 racks with 128 compute nodes in each rack.
The compute nodes are broken down into two types: CPM and APU modules. One of each node type makes up a single compute board. The CPM module hosts four Intel Ivy Bridge processors, 128GB system memory, and a single Intel Xeon Phi accelerator card with 8GB of its own memory. Each APU module adds five Xeon Phi cards to every compute board. The compute boards (a CPM module + a APU module) contain two NICs that connect the various compute boards with Inspur's custom THExpress2 high bandwidth interconnects. Finally, the Tianhe-2 supercomputer will have access to 12.4 Petabytes of storage that is shared across all of the compute boards.
In all, the Tianhe-2 is powered by 32,000 Intel Ivy Bridge processors, 1.024 Petabytes of system memory (not counting Phi dedicated memory--which would make the total 1.404 PB), and 48,000 Intel Xeon Phi MIC (Many Integrated Cores) cards. That is a total of 3,120,000 processor cores (though keep in mind that number is primarily made up of the relatively simple individual Phi cores as there are 57 cores to each Phi card).
Inspur claims up to 3.432 TFlops of peak compute performance per compute node (which, for simplicity they break down as one node is 2 Ivy Bridge chips, 64GB memory, and 3 Xeon Phi cards although the two compute modules that make up a node are not physically laid out that way) for a total theoretical potential compute power of 54,912 TFlops (or 54.912 Petaflops) across the entire supercomputer. In the latest Linpack benchmark run, researchers saw up to 63% efficiency in attaining peak performance -- 30.65 PFlops out of 49.19 PFlops peak/theoretical performance -- when only using 14,336 nodes with 50GB RAM each. Further testing and optimization should improve that number, and when all nodes are brought online the real world performance will naturally be higher than the current benchmarks. With that said, the Tianhe-2 is already besting Cray's TITAN, which is promising (though I hope Cray comes back next year and takes the crown again, heh).
In order to keep all of this hardware cool, Inspur is planning a custom liquid cooling system using chilled water. The Tianhe-2 will draw up to 17.6 MW of power under load. Once the liquid cooling system is implemented the supercomputer will draw 24MW while under load.
This is an impressive system, and an interesting take on a supercomputer architecture considering the rise in popularity of heterogeneous architectures that pair massive numbers of CPUs with graphics processing units (GPUs).
The Tianhe-2 supercomputer will be reconstructed at its permanent home at the National Supercomputer Center in Guangzhou, China once the testing phase is finished. It will be one of the top supercomputers in the world once it is fully online! HPC Wire has a nice article with slides an further details on the upcoming processing powerhouse that is worth a read if you are into this sort of HPC stuff.
Also read: Cray unveils the TITAN supercomputer.
Subject: General Tech, Graphics Cards | March 20, 2013 - 01:47 PM | Tim Verry
Tagged: tesla, tegra 3, supercomputer, pedraforca, nvidia, GTC 2013, GTC, graphics cards, data centers
There is a lot of talk about heterogeneous computing at GTC, in the sense of adding graphics cards to servers. If you have HPC workloads that can benefit from GPU parallelism, adding GPUs gives you computing performance in less physical space, and using less power, than a CPU only cluster (for equivalent TFLOPS).
However, there was a session at GTC that actually took things to the opposite extreme. Instead of a CPU only cluster or a mixed cluster, Alex Ramirez (leader of Heterogeneous Architectures Group at Barcelona Supercomputing Center) is proposing a homogeneous GPU cluster called Pedraforca.
Pedraforca V2 combines NVIDIA Tesla GPUs with low power ARM processors. Each node is comprised of the following components:
- 1 x Mini-ITX carrier board
1 x Q7 module (which hosts the ARM SoC and memory)
- Current config is one Tegra 3 @ 1.3GHz and 2GB DDR2
- 1 x NVIDIA Tesla K20 accelerator card (1170 GFLOPS)
- 1 x InfiniBand 40Gb/s card (via Mellanox ConnectX-3 slot)
- 1 x 2.5" SSD (SATA 3 MLC, 250GB)
The ARM processor is used solely for booting the system and facilitating GPU communication between nodes. It is not intended to be used for computing. According to Dr. Ramirez, in situations where running code on a CPU would be faster, it would be best to have a small number of Intel Xeon powered nodes to do the CPU-favorable computing, and then offload the parallel workloads to the GPU cluster over the InfiniBand connection (though this is less than ideal, Pedraforca would be most-efficient with data-sets that can be processed solely on the Tesla cards).
While Pedraforca is not necessarily locked to NVIDIA's Tegra hardware, it is currently the only SoC that meets their needs. The system requires the ARM chip to have PCI-E support. The Tegra 3 SoC has four PCI-E lanes, so the carrier board is using two PLX chips to allow the Tesla and InfiniBand cards to both be connected.
The researcher stated that he is also looking forward to using NVIDIA's upcoming Logan processor in the Pedraforca cluster. It will reportedly be possible to upgrade existing Pedraforca clusters with the new chips by replacing the existing (Tegra 3) Q7 module with one that has the Logan SoC when it is released.
Pedraforca V2 has an initial cluster size of 64 nodes. While the speaker was reluctant to provide TFLOPS performance numbers, as it would depend on the workload, with 64 Telsa K20 cards, it should provide respectable performance. The intent of the cluster is to save power costs by using a low power CPU. If your sever kernel and applications can run on GPUs alone, there are noticeable power savings to be had by switching from a ~100W Intel Xeon chip to a lower-power (approximately 2-3W) Tegra 3 processor. If you have a kernel that needs to run on a CPU, it is recommended to run the OS on an Intel server and transfer just the GPU work to the Pedraforca cluster. Each Pedraforca node is reportedly under 300W, with the Tesla card being the majority of that figure. Despite the limitations, and niche nature of the workloads and software necessary to get the full power-saving benefits, Pedraforca is certainly an interesting take on a homogeneous server cluster!
In another session relating to the path to exascale computing, power use in data centers was listed as one of the biggest hurdles to getting to Exaflop-levels of performance, and while Pedraforca is not the answer to Exascale, it should at least be a useful learning experience at wringing the most parallelism out of code and pushing GPGPU to the limits. And that research will help other clusters use the GPUs more efficiently as researchers explore the future of computing.
The Pedraforca project built upon research conducted on Tibidabo, a multi-core ARM CPU cluster, and CARMA (CUDA on ARM development kit) which is a Tegra SoC paired with an NVIDIA Quadro card. The two slides below show CARMA benchmarks and a Tibidabo cluster (click on image for larger version).
Stay tuned to PC Perspective for more GTC 2013 coverage!
Subject: General Tech | November 12, 2012 - 06:29 AM | Tim Verry
Tagged: tesla, supercomputer, nvidia, k20x, HPC, CUDA, computing
Graphics card manufacturer NVIDIA launched a new Tesla K20X accelerator card today that supplants the existing K20 as the top of the line model. The new card cranks up the double and single precision floating point performance, beefs up the memory capacity and bandwidth, and brings some efficiency improvements to the supercomputer space.
While it is not yet clear how many CUDA cores the K20X has, NVIDIA has stated that it is using the GK110 GPU, and is running with 6GB of memory with 250 GB/s of bandwidth – a nice improvement over the K20’s 5GB at 208 GB/s. Both the new K20X and K20 accelerator cards are based on the company’s Kepler architecture, but NVIDIA has managed to wring out more performance from the K20X. The K20 is rated at 1.17 TFlops peak double precision and 3.52 TFlops peak single precision while the K20X is rated at 1.31 TFlops and 3.95 TFlops.
The K20X manages to score 1.22 TFlops in DGEmm, which puts it at almost three times faster than the previous generation Tesla M2090 accelerator based on the Fermi architecture.
Aside from pure performance, NVIDIA is also touting efficiency gains with the new K20X accelerator card. When two K20X cards are paired with a 2P Sandy Bridge server, NVIDIA claims to achieve 76% efficiency versus 61% efficiency with a 2P Sandy Bridge server equipped with two previous generation M2090 accelerator cards. Additionally, NVIDIA claims to have enabled the Titan supercomputer to reach the #1 spot on the top 500 green supercomputers thanks to its new cards with a rating of 2,120.16 MFLOPS/W (million floating point operations per second per watt).
NVIDIA claims to have already shipped 30 PFLOPS worth of GPU accelerated computing power. Interestingly, most of that computing power is housed in the recently unveiled Titan supercomputer. This supercomputer contains 18,688 Tesla K20X (Kepler GK110) GPUs and 299,008 16-core AMD Opteron 6274 processors. It will consume 9 megawatts of power and is rated at a peak of 27 Petaflops and 17.59 Petaflops during a sustained Linpack benchmark. Further, when compared to Sandy Bridge processors, the K20 series offers up between 8.2 and 18.1 times more performance at several scientific applications.
While the Tesla cards undoubtedly use more power than CPUs, you need far fewer numbers of accelerator cards than processors to hit the same performance numbers. That is where NVIDIA is getting its power efficiency numbers from.
NVIDIA is aiming the accelerator cards at researchers and businesses doing 3D graphics, visual effects, high performance computing, climate modeling, molecular dynamics, earth science, simulations, fluid dynamics, and other such computationally intensive tasks. Using CUDA and the parrallel nature of the GPU, the Tesla cards can acheive performance much higher than a CPU-only system can. NVIDIA has also engineered software to better parrellelize workloads and keep the GPU accelerators fed with data that the company calls Hyper-Q and Dynamic Parallelism respectively.
It is interesting to see NVIDIA bring out a new flagship, especially another GK110 card. Systems using the K20 and the new K20X are available now with cards shipping this week and general availability later this month.
You can find the full press release below and a look at the GK110 GPU in our preview.
Anandtech also managed to get a look inside the Titan supercomputer at Oak Ridge National Labratory, where you can see the Tesla K20X cards in action.
Subject: Systems | May 24, 2011 - 09:07 PM | Tim Verry
Tagged: tesla, supercomputer, petaflop, HPC, bulldozer
Cray has been a huge name in the supercomputer market for years, and with the new XK6 they are promising to deliver a supercomputer capable of 50 Thousand Trillion operations per second. Powered by AMD Operton CPUs and NVIDIA GPUs, each XK6 blade is comprised of 2 Gemini interconnects pairing four AMD Opteron CPUs with four NVIDIA Tesla X2090 embedded graphics cards. The graphics cards in each blade have access to 6GB of GDDR5 memory, and are connected via PCI-E 2.0 links to the Opteron processors. The CPUS have access to four DDR3 memory slots “running at 1.6GHz for every G34 socket,” according to The Register. This amounts to 32GB per two-socket node when using 4GB sticks.
Cray plans to wait until AMD releases the 16 core 32nm Opteron CPUs in Q3, dubbed the Opteron 6200s. The Register quotes AMD’s CEO Thomas Siefert as promising the processors are based on the new Bulldozer cores (and would be compatible with the current G34 sockets) “would ship by summer.”
Further, they claim that Cray’s goal with the XK6 was to keep the new blades within the same thermal boundaries as its predecessor, despite the inclusion of GPUs into the mix. Cray has indicated that, due to their success in remaining within the thermal envelope, their customers will be able to use XE6 and XK6 blades interchangeably and will allow them to customize their supercomputer load-out to meet the demands of their specific computing workloads.
Each cabinet is capable of storing up to 24 blades, and can deliver up to 50 kilowatts of power. Each of the Tesla X2090 GPUS are capable of 665 gigaflops during double-precision floating point operations, something that GPUs excel at. As each XK6 blade contains 4 GPUS, and each cabinet can hold 24 blades, customers are looking at 63.8 teraflops of computing power solely from the graphics cards. On the CPU side of things, Cray is not able to release specifications on the processors as AMD has yet to deliver the chips in question. The Register estimates that each XK6 blade will provide 3.5 teraflops of floating point computing power, which amounts to approximately 84 teraflops per cabinet.
With a claimed capability to utilize up to 300 cabinets full of XK6 blades, customers are looking at approximately 44 petaflops of computing horsepower, with GPUs delivering 19.14 petaflops, and the CPUs estimated to provide 25.2 petaflops of floating point computational power.
The first customer of this system will be the Swiss National Supercomputing Centre. According to the Seattle Times, the center’s director Professor Thomas Schulthess stated that they chose the Cray XK6 based supercomputer not for it’s raw performance, but because “the Cray XK6 promises to be the first general-purpose supercomputer based on GPU technology, and we are very much looking forward to exploring its performance and productivity on real applications relevant to our scientists.”