Subject: General Tech, Processors, Systems | June 26, 2013 - 10:27 PM | Scott Michaud
Tagged: supercomputing, supercomputer, titan, Xeon Phi
The National Supercomputer Center in Guangzho, China, will host the the world's fastest supercomputer by the end of the year. The Tianhe-2, English: "Milky Way-2", is capable of nearly double the floating-point performance of Titan albeit with slightly less performance per watt. The Tianhe-2 was developed by China's National University of Defense Technology.
Photo Credit: Top500.org
Comparing new fastest computer with the former, China's Milky Way-2 is able to achieve 33.8627 PetaFLOPs of calculations from 17.808 MW of electricity. The Titan, on the other hand, is able to crunch 17.590 PetaFLOPs with a draw of just 8.209 MW. As such, the new Milky Way-2 uses 12.7% more power per FLOP than Titan.
Titan is famously based on the Kepler GPU architecture from NVIDIA, coupled with several 16-core AMD Opteron server processors clocked at 2.2 GHz. This concept of using accelerated hardware carried over into the design of Tianhe-2, which is based around Intel's Xeon Phi coprocessor. If you include the simplified co-processor cores of the Xeon Phi, the new champion is the sum of 3.12 million x86 cores and 1024 terabytes of memory.
... but will it run Crysis?
... if someone gets around to emulating DirectX in software, it very well could.
Subject: Systems | June 3, 2013 - 09:27 PM | Tim Verry
Tagged: Xeon Phi, tianhe-2, supercomputer, Ivy Bridge, HPC, China
A powerful new supercomputer constructed by Chinese company Inspur is currently in testing at the National University of Defense Technology. Called the Tianhe-2, the new supercomputer has 16,000 compute nodes and approximately 54 Petaflops of peak theoretical compute performance.
Destined for the National Supercomputer Center in Guangzhou, China, the open HPC platform will be used for education and research projects. The Tianhe-2 is composed of 125 racks with 128 compute nodes in each rack.
The compute nodes are broken down into two types: CPM and APU modules. One of each node type makes up a single compute board. The CPM module hosts four Intel Ivy Bridge processors, 128GB system memory, and a single Intel Xeon Phi accelerator card with 8GB of its own memory. Each APU module adds five Xeon Phi cards to every compute board. The compute boards (a CPM module + a APU module) contain two NICs that connect the various compute boards with Inspur's custom THExpress2 high bandwidth interconnects. Finally, the Tianhe-2 supercomputer will have access to 12.4 Petabytes of storage that is shared across all of the compute boards.
In all, the Tianhe-2 is powered by 32,000 Intel Ivy Bridge processors, 1.024 Petabytes of system memory (not counting Phi dedicated memory--which would make the total 1.404 PB), and 48,000 Intel Xeon Phi MIC (Many Integrated Cores) cards. That is a total of 3,120,000 processor cores (though keep in mind that number is primarily made up of the relatively simple individual Phi cores as there are 57 cores to each Phi card).
Inspur claims up to 3.432 TFlops of peak compute performance per compute node (which, for simplicity they break down as one node is 2 Ivy Bridge chips, 64GB memory, and 3 Xeon Phi cards although the two compute modules that make up a node are not physically laid out that way) for a total theoretical potential compute power of 54,912 TFlops (or 54.912 Petaflops) across the entire supercomputer. In the latest Linpack benchmark run, researchers saw up to 63% efficiency in attaining peak performance -- 30.65 PFlops out of 49.19 PFlops peak/theoretical performance -- when only using 14,336 nodes with 50GB RAM each. Further testing and optimization should improve that number, and when all nodes are brought online the real world performance will naturally be higher than the current benchmarks. With that said, the Tianhe-2 is already besting Cray's TITAN, which is promising (though I hope Cray comes back next year and takes the crown again, heh).
In order to keep all of this hardware cool, Inspur is planning a custom liquid cooling system using chilled water. The Tianhe-2 will draw up to 17.6 MW of power under load. Once the liquid cooling system is implemented the supercomputer will draw 24MW while under load.
This is an impressive system, and an interesting take on a supercomputer architecture considering the rise in popularity of heterogeneous architectures that pair massive numbers of CPUs with graphics processing units (GPUs).
The Tianhe-2 supercomputer will be reconstructed at its permanent home at the National Supercomputer Center in Guangzhou, China once the testing phase is finished. It will be one of the top supercomputers in the world once it is fully online! HPC Wire has a nice article with slides an further details on the upcoming processing powerhouse that is worth a read if you are into this sort of HPC stuff.
Also read: Cray unveils the TITAN supercomputer.
Subject: General Tech, Graphics Cards | March 20, 2013 - 01:47 PM | Tim Verry
Tagged: tesla, tegra 3, supercomputer, pedraforca, nvidia, GTC 2013, GTC, graphics cards, data centers
There is a lot of talk about heterogeneous computing at GTC, in the sense of adding graphics cards to servers. If you have HPC workloads that can benefit from GPU parallelism, adding GPUs gives you computing performance in less physical space, and using less power, than a CPU only cluster (for equivalent TFLOPS).
However, there was a session at GTC that actually took things to the opposite extreme. Instead of a CPU only cluster or a mixed cluster, Alex Ramirez (leader of Heterogeneous Architectures Group at Barcelona Supercomputing Center) is proposing a homogeneous GPU cluster called Pedraforca.
Pedraforca V2 combines NVIDIA Tesla GPUs with low power ARM processors. Each node is comprised of the following components:
- 1 x Mini-ITX carrier board
1 x Q7 module (which hosts the ARM SoC and memory)
- Current config is one Tegra 3 @ 1.3GHz and 2GB DDR2
- 1 x NVIDIA Tesla K20 accelerator card (1170 GFLOPS)
- 1 x InfiniBand 40Gb/s card (via Mellanox ConnectX-3 slot)
- 1 x 2.5" SSD (SATA 3 MLC, 250GB)
The ARM processor is used solely for booting the system and facilitating GPU communication between nodes. It is not intended to be used for computing. According to Dr. Ramirez, in situations where running code on a CPU would be faster, it would be best to have a small number of Intel Xeon powered nodes to do the CPU-favorable computing, and then offload the parallel workloads to the GPU cluster over the InfiniBand connection (though this is less than ideal, Pedraforca would be most-efficient with data-sets that can be processed solely on the Tesla cards).
While Pedraforca is not necessarily locked to NVIDIA's Tegra hardware, it is currently the only SoC that meets their needs. The system requires the ARM chip to have PCI-E support. The Tegra 3 SoC has four PCI-E lanes, so the carrier board is using two PLX chips to allow the Tesla and InfiniBand cards to both be connected.
The researcher stated that he is also looking forward to using NVIDIA's upcoming Logan processor in the Pedraforca cluster. It will reportedly be possible to upgrade existing Pedraforca clusters with the new chips by replacing the existing (Tegra 3) Q7 module with one that has the Logan SoC when it is released.
Pedraforca V2 has an initial cluster size of 64 nodes. While the speaker was reluctant to provide TFLOPS performance numbers, as it would depend on the workload, with 64 Telsa K20 cards, it should provide respectable performance. The intent of the cluster is to save power costs by using a low power CPU. If your sever kernel and applications can run on GPUs alone, there are noticeable power savings to be had by switching from a ~100W Intel Xeon chip to a lower-power (approximately 2-3W) Tegra 3 processor. If you have a kernel that needs to run on a CPU, it is recommended to run the OS on an Intel server and transfer just the GPU work to the Pedraforca cluster. Each Pedraforca node is reportedly under 300W, with the Tesla card being the majority of that figure. Despite the limitations, and niche nature of the workloads and software necessary to get the full power-saving benefits, Pedraforca is certainly an interesting take on a homogeneous server cluster!
In another session relating to the path to exascale computing, power use in data centers was listed as one of the biggest hurdles to getting to Exaflop-levels of performance, and while Pedraforca is not the answer to Exascale, it should at least be a useful learning experience at wringing the most parallelism out of code and pushing GPGPU to the limits. And that research will help other clusters use the GPUs more efficiently as researchers explore the future of computing.
The Pedraforca project built upon research conducted on Tibidabo, a multi-core ARM CPU cluster, and CARMA (CUDA on ARM development kit) which is a Tegra SoC paired with an NVIDIA Quadro card. The two slides below show CARMA benchmarks and a Tibidabo cluster (click on image for larger version).
Stay tuned to PC Perspective for more GTC 2013 coverage!
Subject: General Tech | November 12, 2012 - 06:29 AM | Tim Verry
Tagged: tesla, supercomputer, nvidia, k20x, HPC, CUDA, computing
Graphics card manufacturer NVIDIA launched a new Tesla K20X accelerator card today that supplants the existing K20 as the top of the line model. The new card cranks up the double and single precision floating point performance, beefs up the memory capacity and bandwidth, and brings some efficiency improvements to the supercomputer space.
While it is not yet clear how many CUDA cores the K20X has, NVIDIA has stated that it is using the GK110 GPU, and is running with 6GB of memory with 250 GB/s of bandwidth – a nice improvement over the K20’s 5GB at 208 GB/s. Both the new K20X and K20 accelerator cards are based on the company’s Kepler architecture, but NVIDIA has managed to wring out more performance from the K20X. The K20 is rated at 1.17 TFlops peak double precision and 3.52 TFlops peak single precision while the K20X is rated at 1.31 TFlops and 3.95 TFlops.
The K20X manages to score 1.22 TFlops in DGEmm, which puts it at almost three times faster than the previous generation Tesla M2090 accelerator based on the Fermi architecture.
Aside from pure performance, NVIDIA is also touting efficiency gains with the new K20X accelerator card. When two K20X cards are paired with a 2P Sandy Bridge server, NVIDIA claims to achieve 76% efficiency versus 61% efficiency with a 2P Sandy Bridge server equipped with two previous generation M2090 accelerator cards. Additionally, NVIDIA claims to have enabled the Titan supercomputer to reach the #1 spot on the top 500 green supercomputers thanks to its new cards with a rating of 2,120.16 MFLOPS/W (million floating point operations per second per watt).
NVIDIA claims to have already shipped 30 PFLOPS worth of GPU accelerated computing power. Interestingly, most of that computing power is housed in the recently unveiled Titan supercomputer. This supercomputer contains 18,688 Tesla K20X (Kepler GK110) GPUs and 299,008 16-core AMD Opteron 6274 processors. It will consume 9 megawatts of power and is rated at a peak of 27 Petaflops and 17.59 Petaflops during a sustained Linpack benchmark. Further, when compared to Sandy Bridge processors, the K20 series offers up between 8.2 and 18.1 times more performance at several scientific applications.
While the Tesla cards undoubtedly use more power than CPUs, you need far fewer numbers of accelerator cards than processors to hit the same performance numbers. That is where NVIDIA is getting its power efficiency numbers from.
NVIDIA is aiming the accelerator cards at researchers and businesses doing 3D graphics, visual effects, high performance computing, climate modeling, molecular dynamics, earth science, simulations, fluid dynamics, and other such computationally intensive tasks. Using CUDA and the parrallel nature of the GPU, the Tesla cards can acheive performance much higher than a CPU-only system can. NVIDIA has also engineered software to better parrellelize workloads and keep the GPU accelerators fed with data that the company calls Hyper-Q and Dynamic Parallelism respectively.
It is interesting to see NVIDIA bring out a new flagship, especially another GK110 card. Systems using the K20 and the new K20X are available now with cards shipping this week and general availability later this month.
You can find the full press release below and a look at the GK110 GPU in our preview.
Anandtech also managed to get a look inside the Titan supercomputer at Oak Ridge National Labratory, where you can see the Tesla K20X cards in action.
Subject: Systems | May 24, 2011 - 09:07 PM | Tim Verry
Tagged: tesla, supercomputer, petaflop, HPC, bulldozer
Cray has been a huge name in the supercomputer market for years, and with the new XK6 they are promising to deliver a supercomputer capable of 50 Thousand Trillion operations per second. Powered by AMD Operton CPUs and NVIDIA GPUs, each XK6 blade is comprised of 2 Gemini interconnects pairing four AMD Opteron CPUs with four NVIDIA Tesla X2090 embedded graphics cards. The graphics cards in each blade have access to 6GB of GDDR5 memory, and are connected via PCI-E 2.0 links to the Opteron processors. The CPUS have access to four DDR3 memory slots “running at 1.6GHz for every G34 socket,” according to The Register. This amounts to 32GB per two-socket node when using 4GB sticks.
Cray plans to wait until AMD releases the 16 core 32nm Opteron CPUs in Q3, dubbed the Opteron 6200s. The Register quotes AMD’s CEO Thomas Siefert as promising the processors are based on the new Bulldozer cores (and would be compatible with the current G34 sockets) “would ship by summer.”
Further, they claim that Cray’s goal with the XK6 was to keep the new blades within the same thermal boundaries as its predecessor, despite the inclusion of GPUs into the mix. Cray has indicated that, due to their success in remaining within the thermal envelope, their customers will be able to use XE6 and XK6 blades interchangeably and will allow them to customize their supercomputer load-out to meet the demands of their specific computing workloads.
Each cabinet is capable of storing up to 24 blades, and can deliver up to 50 kilowatts of power. Each of the Tesla X2090 GPUS are capable of 665 gigaflops during double-precision floating point operations, something that GPUs excel at. As each XK6 blade contains 4 GPUS, and each cabinet can hold 24 blades, customers are looking at 63.8 teraflops of computing power solely from the graphics cards. On the CPU side of things, Cray is not able to release specifications on the processors as AMD has yet to deliver the chips in question. The Register estimates that each XK6 blade will provide 3.5 teraflops of floating point computing power, which amounts to approximately 84 teraflops per cabinet.
With a claimed capability to utilize up to 300 cabinets full of XK6 blades, customers are looking at approximately 44 petaflops of computing horsepower, with GPUs delivering 19.14 petaflops, and the CPUs estimated to provide 25.2 petaflops of floating point computational power.
The first customer of this system will be the Swiss National Supercomputing Centre. According to the Seattle Times, the center’s director Professor Thomas Schulthess stated that they chose the Cray XK6 based supercomputer not for it’s raw performance, but because “the Cray XK6 promises to be the first general-purpose supercomputer based on GPU technology, and we are very much looking forward to exploring its performance and productivity on real applications relevant to our scientists.”
Get notified when we go live!