Subject: Processors | November 18, 2015 - 12:34 PM | Scott Michaud
Tagged: Xeon Phi, knights landing, Intel
The add-in board version of the Xeon Phi has just launched, which Intel aims at supercomputing audiences. They also announced that this product will be available as a socketed processor that is embedded in, as PC World states, “a limited number of workstations” by the first half of next year. The interesting part about these processors is that they combine a GPU-like architecture with the x86 instruction set.
Image Credit: Intel (Developer Zone)
In the case of next year's socketed Knights Landing CPUs, you can even boot your OS with it (and no other processor installed). It will probably be a little like running a 72-core Atom-based netbook.
To make it a little more clear, Knights Landing is a 72-core, 512-bit processor. You might wonder how that can compete against a modern GPU, which has thousands of cores, but those are not really cores in the CPU sense. GPUs crunch massive amounts of calculations by essentially tying several cores together, and doing other tricks to minimize die area per effective instruction. NVIDIA ties 32 instructions together and pushes them down the silicon. As long as they don't diverge, you can get 32 independent computations for very little die area. AMD packs 64 together.
Knight's Landing does the same. The 512-bit registers can hold 16 single-precision (32-bit) values and operate on them simultaneously.
16 times 72 is 1152. All of a sudden, we're in shader-count territory. This is one of the reasons why they can achieve such high performance with “only” 72 cores, compared to the “thousands” that are present on GPUs. They're actually on a similar scale, just counted differently.
Update: (November 18th @ 1:51 pm EST) I just realized that, while I kept saying "one of the reasons", I never elaborated on the other points. Knights Landing also has four threads per core. So that "72 core" is actually "288 thread", with 512-bit registers that can perform sixteen 32-bit SIMD instructions simultaneously. While hyperthreading is not known to be 100% efficient, you could consider Knights Landing to be a GPU with 4608 shader units. Again, it's not the best way to count it, but it could sort-of work.
So in terms of raw performance, Knights Landing can crunch about 8 TeraFLOPs of single-precision performance or around 3 TeraFLOPs of double-precision, 64-bit performance. This is around 30% faster than the Titan X in single precision, and around twice the performance of Titan Black in double precision. NVIDIA basically removed the FP64 compute units from Maxwell / Titan X, so Knight's Landing is about 16x faster, but that's not really a fair comparison. NVIDIA recommends Kepler for double-precision workloads.
So interestingly, Knights Landing would be a top-tier graphics card (in terms of shading performance) if it was compatible with typical graphics APIs. Of course, it's not, and it will be priced way higher than, for instance, the AMD Radeon Fury X. Knight's Landing isn't available on Intel ARK yet, but previous models are in the $2000 - $4000 range.
Subject: General Tech | March 27, 2015 - 06:23 PM | Jeremy Hellstrom
Tagged: Xeon Phi, silvermont, knights landing, Intel
Today a bit more information about Intel's upcoming Knights Landing platform appeared at The Register. The 60 core and 240 thread figure is quoted once again though now we know there is over 8 billion transistors on the chip, which does not include the 16 GB of near memory also present on the package. The processor will support six memory channel, three each in two memory controllers on the die, with a total of 384 GB of far memory. The terms near and far are new, representing onboard and external memory respectively. There is a lot more information you can dig into by following the link on The Register to this long article posted at The Platform.
"Intel has set some rumours to rest, giving a media and analyst briefing outlining details of its coming 60-plus core Knights Landing Xeon Phi chip."
Here is some more Tech News from around the web:
- Silent server monitoring: A neat little cure that doesn't kill the patient @ The Register
- Qualcomm, MediaTek shifting some 28nm chip orders away from TSMC @ DigiTimes
- Quebec Plans To Require Website Blocking, Studies New Internet Access Tax @ Slashdot
- Intel and Micron team up on 3D NAND flash memory for 10TB SSDs @ The Inquirer
- Toshiba and Sandisk announce BiCS as the first 48-layer 3D flash chip @ The Inquirer
Subject: General Tech, Graphics Cards, Processors | July 19, 2014 - 07:05 AM | Scott Michaud
Tagged: Xeon Phi, xeon, Intel, avx-512, avx
It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.
This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.
|Conflict Detection Instructions||Yes||Yes||Yes|
|Exponential and Reciprocal Instructions||No||Yes||Yes|
|Byte and Word Instructions||Yes||No||No|
|Doubleword and Quadword Instructions||Yes||No||No|
|Vector Length Extensions||Yes||No||No|
Source: Intel AVX-512 Blog Post (and my understanding thereof).
So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.
Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).
For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.
This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.
Subject: General Tech | July 3, 2014 - 07:17 PM | Ken Addison
Tagged: podcast, video, evga, TORQ X10, Samsung, 850 PRO, ocz, RevoDrive 350, Silverstone, Nightjar, knights landing, Xeon Phi
PC Perspective Podcast #307 - 07/03/2014
Join us this week as we discuss the EVGA Torq X10 Mouse, Samsung 850 Pro, OCZ RevoDrive 350 and more!
The URL for the podcast is: http://pcper.com/podcast - Share with your friends!
- iTunes - Subscribe to the podcast directly through the Store
- RSS - Subscribe through your regular RSS reader
- MP3 - Direct download link to the MP3 file
Hosts: Ryan Shrout, Josh Walrath, Jeremy Hellstrom, and Morry Tietelman
Subject: General Tech, Graphics Cards, Processors | July 2, 2014 - 07:55 AM | Scott Michaud
Tagged: Intel, Xeon Phi, xeon, silvermont, 14nm
Anandtech has just published a large editorial detailing Intel's Knights Landing. Mostly, it is stuff that we already knew from previous announcements and leaks, such as one by VR-Zone from last November (which we reported on). Officially, few details were given back then, except that it would be available as either a PCIe-based add-in board or as a socketed, bootable, x86-compatible processor based on the Silvermont architecture. Its many cores, threads, and 512 bit registers are each pretty weak, compared to Haswell, for instance, but combine to about 3 TFLOPs of double precision performance.
Not enough graphs. Could use another 256...
The best way to imagine it is running a PC with a modern, Silvermont-based Atom processor -- only with up to 288 processors listed in your Task Manager (72 actual cores with quad HyperThreading).
The main limitation of GPUs (and similar coprocessors), however, is memory bandwidth. GDDR5 is often the main bottleneck of compute performance and just about the first thing to be optimized. To compensate, Intel is packaging up-to 16GB of memory (stacked DRAM) on the chip, itself. This RAM is based on "Hybrid Memory Cube" (HMC), developed by Micron Technology, and supported by the Hybrid Memory Cube Consortium (HMCC). While the actual memory used in Knights Landing is derived from HMC, it uses a proprietary interface that is customized for Knights Landing. Its bandwidth is rated at around 500GB/s. For comparison, the NVIDIA GeForce Titan Black has 336.4GB/s of memory bandwidth.
Intel and Micron have worked together in the past. In 2006, the two companies formed "IM Flash" to produce the NAND flash for Intel and Crucial SSDs. Crucial is Micron's consumer-facing brand.
So the vision for Knights Landing seems to be the bridge between CPU-like architectures and GPU-like ones. For compute tasks, GPUs edge out CPUs by crunching through bundles of similar tasks at the same time, across many (hundreds of, thousands of) computing units. The difference with (at least socketed) Xeon Phi processors is that, unlike most GPUs, Intel does not rely upon APIs, such as OpenCL, and drivers to translate a handful of functions into bundles of GPU-specific machine language. Instead, especially if the Xeon Phi is your system's main processor, it will run standard, x86-based software. The software will just run slowly, unless it is capable of vectorizing itself and splitting across multiple threads. Obviously, OpenCL (and other APIs) would make this parallelization easy, by their host/kernel design, but it is apparently not required.
It is a cool way that Intel arrives at the same goal, based on their background. Especially when you mix-and-match Xeons and Xeon Phis on the same computer, it is a push toward heterogeneous computing -- with a lot of specialized threads backing up a handful of strong ones. I just wonder if providing a more-direct method of programming will really help developers finally adopt massively parallel coding practices.
I mean, without even considering GPU compute, how efficient is most software at splitting into even two threads? Four threads? Eight threads? Can this help drive heterogeneous development? Or will this product simply try to appeal to those who are already considering it?
Subject: General Tech, Graphics Cards, Processors | November 28, 2013 - 08:30 AM | Scott Michaud
Tagged: Intel, Xeon Phi, gpgpu
Intel was testing the waters with their Xeon Phi co-processor. Based on the architecture designed for the original Pentium processors, it was released in six products ranging from 57 to 61 cores and 6 to 16GB of RAM. This lead to double precision performance of between 1 and 1.2 TFLOPs. It was fabricated using their 22nm tri-gate technology. All of this was under the Knights Corner initiative.
In 2015, Intel plans to have Knights Landing ready for consumption. A modified Silvermont architecture will replace the many simple (basically 15 year-old) cores of the previous generation; up to 72 Silvermont-based cores (each with 4 threads) in fact. It will introduce the AVX-512 instruction set. AVX-512 allows applications to vectorize 8 64-bit (double-precision float or long integer) or 16 32-bit (single-precision float or standard integer) values.
In other words, packing a bunch of related problems into a single instruction.
The most interesting part? Two versions will be offered: Add-In Boards (AIBs) and a standalone CPU. It will not require a host CPU, because of its x86 heritage, if your application is entirely suited for an MIC architecture; unlike a Tesla, it is bootable with existing and common OSes. It can also be paired with standard Xeon processors if you would like a few strong threads with the 288 (72 x 4) the Xeon Phi provides.
And, while I doubt Intel would want to cut anyone else in, VR-Zone notes that this opens the door for AIB partners to make non-reference cards and manage some level of customer support. I'll believe a non-Intel branded AIB only when I see it.
Subject: General Tech | November 20, 2013 - 05:53 PM | Jeremy Hellstrom
Tagged: Xeon Phi, knights landing, Intel, 14nm
Intel has been talking up the Xeon Phi, first of the Knight's Landing chips which shall arrive in the not too distant future. This new architecture is touted to bring a return of homogeneous systems architecture which will perform parallel processing on its many cores, currently 61 is the number being tossed around, at a level of performance that will exceed the GPU accelerated heterogeneous architecture being pushed by AMD and NVIDIA. Whether this is true or not remains to be seen but many server builders may prefer the familiar CPU only architecture and as at least some of the Phi's will be available in rack mounted form and not just addin cards they may choose Intel out of habit. You can also read about Micron's Automata Processor which The Register reports can outperform a 48-chip cluster of Intel Xeon 5650s in certain scenarios.
"From Intel's point of view, today's hottest trend in high-performance computing – GPU acceleration – is just a phase, one that will be superseded by the advent of many-core CPUs, beginning with Chipzilla's next-generation Xeon Phi, codenamed "Knights Landing"."
Here is some more Tech News from around the web:
- AMD makes more profit than Sony on the PS4 @ The Inquirer
- Decades ago, computing was saved by CMOS. Today, no hero is in sight @ The Register
- Who wants 10TB of FREE cloud storage? Hands down if China is a deal breaker? @ The Register
- Benchmarking Amazon's New EC2 "C3" Instance Types @ Phoronix
- What took you 2 years? LSI finally rolls out next-gen SandForce kit @ The Register
- Beginners Guide: How To Install / Remove an Intel Socket LGA2011 CPU @ PCSTATS
- Hardware.Info UK Awards 2013: vote and win a Zotac GeForce GTX 780 AMP! Edition
Subject: General Tech, Systems | September 9, 2013 - 01:00 PM | Tim Verry
Tagged: Xeon Phi, workstation, quadro, micron, LSI, k6000, Ivy Bridge-EP, firepro, dell
Along with the release of new mobile workstations, Dell announced three new desktop workstations. Specifically, Dell is launching the T3610, T5610, and T7610 PC workstations under its Precision series. The new systems reside in redesigned cases with improved cable management, removable power supplies (tool-less, removable by sliding out from rear panel), and in the case of the T7610 removable hard drives. All of the new Precision workstations have been outfitted with Intel's latest Ivy Bridge-EP based Xeon processors, ECC memory, workstation-class graphics cards from AMD and NVIDIA, Xeon Phi accelerator card options, LSI hardware RAID controllers, and updated software solutions from Intel and Dell.
The new Precision workstations side-by-side. From left to right: T3610, T5610, and T7610.
Dell's Precision T3610 is a the mid-tower system of the group powered by single socket Xeon E5-2600 v2 hardware that further supports up to 128GB DDR3 ECC memory, two graphics cards, three 3.5” hard drives, and four 2.5” SSDs.
The Precision T3610, a new single socket, mid-range workstation.
The Precision T5610 ups the ante to a dual socket IVB-EP processor system that can be configured with up to 128GB DDR3 ECC memory, two AMD FirePro or NVIDIA Quadro (e.g. Quadro K5000) graphics cards, a Tesla K20C accelerator card, three 3.5” hard drives, and four 2.5” solid state drives.
Finally, the T7610 workstation supports dual Intel Ivy Bridge-EP Xeon E5-2600 v2 series processors (up to 24 cores per system), up to 512GB DDR3 ECC memory, three graphics cards (including two NVIDIA Quadro K6000 cards), four 3.5” hard drives, and eight 2.5” SSDs.
Dell's Precision T5610 dual socket workstation.
The new Precision workstations can also be configured with an Intel Xeon Phi 3120A accelerator card in lieu of a Tesla card. The choice will mainly depend on the applications being used and the development resources and expertise available. Both options are designed to accelerate highly parallel workloads in applications that have been compiled to support them. Further, users can add an LSI hardware RAID card with 1GB of onboard memory to the systems. Dell further offers a Micron P320h PCI-E SSD that, while not bootable, offers up 350GB of high performance storage that excels at high sequential reads and writes.
On the software front, Dell is including the Dell Precision Performance Optimizer and the Intel Cache Acceleration Software. The former automatically configures and optimizes the workstation for specific applications based on profiles that are reportedly regularly updated. The other bit of software works to optimize systems that use both hard drives and SSDs with the SSDs as a cache for the mechanical storage. The Intel Cache Acceleration Software configures the caching algorithms to favor caching very large files on the solid state storage. It is a different approach to consumer caching strategies, but one that works well with businesses that use these workstations to process large data sets.
The Dell Precision T7610 workstation.
The Dell workstations are aimed at businesses doing scientific analysis, professional engineering, and complex 3D modeling. The T7610 in particular is aimed at the oil and gas industry for use in simulations and modeling as companies search for new oil deposits.
All three systems will be available for purchase worldwide beginning September 12th. Some of the options, such as 512GB of ECC and the NVIDIA Quadro K6000 on the T7610 will not be available until next month, however. The T3610 has a starting price of $1,099 while the T5610 and T7610 have starting prices of $2,729 and $3,059 respectively.
What are your thoughts on Dell's new mid-tower workstations?
Subject: General Tech, Processors, Systems | June 27, 2013 - 02:27 AM | Scott Michaud
Tagged: supercomputing, supercomputer, titan, Xeon Phi
The National Supercomputer Center in Guangzho, China, will host the the world's fastest supercomputer by the end of the year. The Tianhe-2, English: "Milky Way-2", is capable of nearly double the floating-point performance of Titan albeit with slightly less performance per watt. The Tianhe-2 was developed by China's National University of Defense Technology.
Photo Credit: Top500.org
Comparing new fastest computer with the former, China's Milky Way-2 is able to achieve 33.8627 PetaFLOPs of calculations from 17.808 MW of electricity. The Titan, on the other hand, is able to crunch 17.590 PetaFLOPs with a draw of just 8.209 MW. As such, the new Milky Way-2 uses 12.7% more power per FLOP than Titan.
Titan is famously based on the Kepler GPU architecture from NVIDIA, coupled with several 16-core AMD Opteron server processors clocked at 2.2 GHz. This concept of using accelerated hardware carried over into the design of Tianhe-2, which is based around Intel's Xeon Phi coprocessor. If you include the simplified co-processor cores of the Xeon Phi, the new champion is the sum of 3.12 million x86 cores and 1024 terabytes of memory.
... but will it run Crysis?
... if someone gets around to emulating DirectX in software, it very well could.
Subject: Systems | June 4, 2013 - 01:27 AM | Tim Verry
Tagged: Xeon Phi, tianhe-2, supercomputer, Ivy Bridge, HPC, China
A powerful new supercomputer constructed by Chinese company Inspur is currently in testing at the National University of Defense Technology. Called the Tianhe-2, the new supercomputer has 16,000 compute nodes and approximately 54 Petaflops of peak theoretical compute performance.
Destined for the National Supercomputer Center in Guangzhou, China, the open HPC platform will be used for education and research projects. The Tianhe-2 is composed of 125 racks with 128 compute nodes in each rack.
The compute nodes are broken down into two types: CPM and APU modules. One of each node type makes up a single compute board. The CPM module hosts four Intel Ivy Bridge processors, 128GB system memory, and a single Intel Xeon Phi accelerator card with 8GB of its own memory. Each APU module adds five Xeon Phi cards to every compute board. The compute boards (a CPM module + a APU module) contain two NICs that connect the various compute boards with Inspur's custom THExpress2 high bandwidth interconnects. Finally, the Tianhe-2 supercomputer will have access to 12.4 Petabytes of storage that is shared across all of the compute boards.
In all, the Tianhe-2 is powered by 32,000 Intel Ivy Bridge processors, 1.024 Petabytes of system memory (not counting Phi dedicated memory--which would make the total 1.404 PB), and 48,000 Intel Xeon Phi MIC (Many Integrated Cores) cards. That is a total of 3,120,000 processor cores (though keep in mind that number is primarily made up of the relatively simple individual Phi cores as there are 57 cores to each Phi card).
Inspur claims up to 3.432 TFlops of peak compute performance per compute node (which, for simplicity they break down as one node is 2 Ivy Bridge chips, 64GB memory, and 3 Xeon Phi cards although the two compute modules that make up a node are not physically laid out that way) for a total theoretical potential compute power of 54,912 TFlops (or 54.912 Petaflops) across the entire supercomputer. In the latest Linpack benchmark run, researchers saw up to 63% efficiency in attaining peak performance -- 30.65 PFlops out of 49.19 PFlops peak/theoretical performance -- when only using 14,336 nodes with 50GB RAM each. Further testing and optimization should improve that number, and when all nodes are brought online the real world performance will naturally be higher than the current benchmarks. With that said, the Tianhe-2 is already besting Cray's TITAN, which is promising (though I hope Cray comes back next year and takes the crown again, heh).
In order to keep all of this hardware cool, Inspur is planning a custom liquid cooling system using chilled water. The Tianhe-2 will draw up to 17.6 MW of power under load. Once the liquid cooling system is implemented the supercomputer will draw 24MW while under load.
This is an impressive system, and an interesting take on a supercomputer architecture considering the rise in popularity of heterogeneous architectures that pair massive numbers of CPUs with graphics processing units (GPUs).
The Tianhe-2 supercomputer will be reconstructed at its permanent home at the National Supercomputer Center in Guangzhou, China once the testing phase is finished. It will be one of the top supercomputers in the world once it is fully online! HPC Wire has a nice article with slides an further details on the upcoming processing powerhouse that is worth a read if you are into this sort of HPC stuff.
Also read: Cray unveils the TITAN supercomputer.