James Reinders Leaving Intel and What It Means

Subject: Processors | June 8, 2016 - 12:17 PM |
Tagged: Xeon Phi, Intel, gpgpu

Intel's recent restructure had a much broader impact than I originally believed. Beyond the large number of employees who will lose their jobs, we're even seeing it affect other areas of the industry. Typically, ASUS releases their ZenPhone line with x86 processors, which I assumed was based on big subsidies from Intel to push their instruction set into new product categories. This year, ASUS chose the ARM-based Qualcomm Snapdragon, which seemed to me like Intel decided to stop the bleeding.

reinders148x148.jpg

That brings us to today's news. After over 27 years at Intel, James Reinders accepted the company's early retirement offer, scheduled for his 10001st day with the company, and step down from his position as Intel's High Performance Computing Director. He worked on the Larabee and Xeon Phi initiatives, and published several books on parallelism.

According to his letter, it sounds like his retirement offer was part of a company-wide package, and not targeting his division specifically. That would sort-of make sense, because Intel is focusing on cloud and IoT. Xeon Phi is an area that Intel is battling NVIDIA for high-performance servers, and I would expect that it has potential for cloud-based applications. Then again, as I say that, AWS only has a handful of GPU instances, and they are running fairly old hardware at that, so maybe the demand isn't there yet.

Intel Launches Knights Landing-based Xeon Phi AIBs

Subject: Processors | November 18, 2015 - 12:34 PM |
Tagged: Xeon Phi, knights landing, Intel

The add-in board version of the Xeon Phi has just launched, which Intel aims at supercomputing audiences. They also announced that this product will be available as a socketed processor that is embedded in, as PC World states, “a limited number of workstations” by the first half of next year. The interesting part about these processors is that they combine a GPU-like architecture with the x86 instruction set.

intel-2015-KNL die.jpg

Image Credit: Intel (Developer Zone)

In the case of next year's socketed Knights Landing CPUs, you can even boot your OS with it (and no other processor installed). It will probably be a little like running a 72-core Atom-based netbook.

To make it a little more clear, Knights Landing is a 72-core, 512-bit processor. You might wonder how that can compete against a modern GPU, which has thousands of cores, but those are not really cores in the CPU sense. GPUs crunch massive amounts of calculations by essentially tying several cores together, and doing other tricks to minimize die area per effective instruction. NVIDIA ties 32 instructions together and pushes them down the silicon. As long as they don't diverge, you can get 32 independent computations for very little die area. AMD packs 64 together.

Knight's Landing does the same. The 512-bit registers can hold 16 single-precision (32-bit) values and operate on them simultaneously.

16 times 72 is 1152. All of a sudden, we're in shader-count territory. This is one of the reasons why they can achieve such high performance with “only” 72 cores, compared to the “thousands” that are present on GPUs. They're actually on a similar scale, just counted differently.

Update: (November 18th @ 1:51 pm EST) I just realized that, while I kept saying "one of the reasons", I never elaborated on the other points. Knights Landing also has four threads per core. So that "72 core" is actually "288 thread", with 512-bit registers that can perform sixteen 32-bit SIMD instructions simultaneously. While hyperthreading is not known to be 100% efficient, you could consider Knights Landing to be a GPU with 4608 shader units. Again, it's not the best way to count it, but it could sort-of work.

So in terms of raw performance, Knights Landing can crunch about 8 TeraFLOPs of single-precision performance or around 3 TeraFLOPs of double-precision, 64-bit performance. This is around 30% faster than the Titan X in single precision, and around twice the performance of Titan Black in double precision. NVIDIA basically removed the FP64 compute units from Maxwell / Titan X, so Knight's Landing is about 16x faster, but that's not really a fair comparison. NVIDIA recommends Kepler for double-precision workloads.

So interestingly, Knights Landing would be a top-tier graphics card (in terms of shading performance) if it was compatible with typical graphics APIs. Of course, it's not, and it will be priced way higher than, for instance, the AMD Radeon Fury X. Knight's Landing isn't available on Intel ARK yet, but previous models are in the $2000 - $4000 range.

Source: PC World

Learn a bit more about Knights Landing

Subject: General Tech | March 27, 2015 - 06:23 PM |
Tagged: Xeon Phi, silvermont, knights landing, Intel

Today a bit more information about Intel's upcoming Knights Landing platform appeared at The Register.  The 60 core and 240 thread figure is quoted once again though now we know there is over 8 billion transistors on the chip, which does not include the 16 GB of near memory also present on the package.  The processor will support six memory channel, three each in two memory controllers on the die, with a total of 384 GB of far memory.  The terms near and far are new, representing onboard and external memory respectively.  There is a lot more information you can dig into by following the link on The Register to this long article posted at The Platform.

intel-knights-landing-sled-new.jpg

"Intel has set some rumours to rest, giving a media and analyst briefing outlining details of its coming 60-plus core Knights Landing Xeon Phi chip."

Here is some more Tech News from around the web:

Tech Talk

 

Source: The Register

Intel AVX-512 Expanded

Subject: General Tech, Graphics Cards, Processors | July 19, 2014 - 07:05 AM |
Tagged: Xeon Phi, xeon, Intel, avx-512, avx

It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.

Intel_Xeon_Phi_Family.jpg

This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.

 
Xeon
Processors
Xeon Phi
Processors
Xeon Phi
Coprocessors (AIBs)
Foundation Instructions Yes Yes Yes
Conflict Detection Instructions Yes Yes Yes
Exponential and Reciprocal Instructions No Yes Yes
Prefetch Instructions No Yes Yes
Byte and Word Instructions Yes No No
Doubleword and Quadword Instructions Yes No No
Vector Length Extensions Yes No No

Source: Intel AVX-512 Blog Post (and my understanding thereof).

So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.

Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).

For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.

This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.

Source: Intel

Podcast #307 - EVGA Torq X10 Mouse, Samsung 850 Pro, OCZ RevoDrive 350 and more!

Subject: General Tech | July 3, 2014 - 07:17 PM |
Tagged: podcast, video, evga, TORQ X10, Samsung, 850 PRO, ocz, RevoDrive 350, Silverstone, Nightjar, knights landing, Xeon Phi

PC Perspective Podcast #307 - 07/03/2014

Join us this week as we discuss the EVGA Torq X10 Mouse, Samsung 850 Pro, OCZ RevoDrive 350 and more!

You can subscribe to us through iTunes and you can still access it directly through the RSS page HERE.

The URL for the podcast is: http://pcper.com/podcast - Share with your friends!

  • iTunes - Subscribe to the podcast directly through the iTunes Store
  • RSS - Subscribe through your regular RSS reader
  • MP3 - Direct download link to the MP3 file

Hosts: Ryan Shrout, Josh Walrath, Jeremy Hellstrom, and Morry Tietelman

Program length: 1:19:27

Subscribe to the PC Perspective YouTube Channel for more videos, reviews and podcasts!!

 

Intel's Knights Landing (Xeon Phi, 2015) Details

Subject: General Tech, Graphics Cards, Processors | July 2, 2014 - 07:55 AM |
Tagged: Intel, Xeon Phi, xeon, silvermont, 14nm

Anandtech has just published a large editorial detailing Intel's Knights Landing. Mostly, it is stuff that we already knew from previous announcements and leaks, such as one by VR-Zone from last November (which we reported on). Officially, few details were given back then, except that it would be available as either a PCIe-based add-in board or as a socketed, bootable, x86-compatible processor based on the Silvermont architecture. Its many cores, threads, and 512 bit registers are each pretty weak, compared to Haswell, for instance, but combine to about 3 TFLOPs of double precision performance.

itsbeautiful.png

Not enough graphs. Could use another 256...

The best way to imagine it is running a PC with a modern, Silvermont-based Atom processor -- only with up to 288 processors listed in your Task Manager (72 actual cores with quad HyperThreading).

The main limitation of GPUs (and similar coprocessors), however, is memory bandwidth. GDDR5 is often the main bottleneck of compute performance and just about the first thing to be optimized. To compensate, Intel is packaging up-to 16GB of memory (stacked DRAM) on the chip, itself. This RAM is based on "Hybrid Memory Cube" (HMC), developed by Micron Technology, and supported by the Hybrid Memory Cube Consortium (HMCC). While the actual memory used in Knights Landing is derived from HMC, it uses a proprietary interface that is customized for Knights Landing. Its bandwidth is rated at around 500GB/s. For comparison, the NVIDIA GeForce Titan Black has 336.4GB/s of memory bandwidth.

Intel and Micron have worked together in the past. In 2006, the two companies formed "IM Flash" to produce the NAND flash for Intel and Crucial SSDs. Crucial is Micron's consumer-facing brand.

intel-knights-landing.jpg

So the vision for Knights Landing seems to be the bridge between CPU-like architectures and GPU-like ones. For compute tasks, GPUs edge out CPUs by crunching through bundles of similar tasks at the same time, across many (hundreds of, thousands of) computing units. The difference with (at least socketed) Xeon Phi processors is that, unlike most GPUs, Intel does not rely upon APIs, such as OpenCL, and drivers to translate a handful of functions into bundles of GPU-specific machine language. Instead, especially if the Xeon Phi is your system's main processor, it will run standard, x86-based software. The software will just run slowly, unless it is capable of vectorizing itself and splitting across multiple threads. Obviously, OpenCL (and other APIs) would make this parallelization easy, by their host/kernel design, but it is apparently not required.

It is a cool way that Intel arrives at the same goal, based on their background. Especially when you mix-and-match Xeons and Xeon Phis on the same computer, it is a push toward heterogeneous computing -- with a lot of specialized threads backing up a handful of strong ones. I just wonder if providing a more-direct method of programming will really help developers finally adopt massively parallel coding practices.

I mean, without even considering GPU compute, how efficient is most software at splitting into even two threads? Four threads? Eight threads? Can this help drive heterogeneous development? Or will this product simply try to appeal to those who are already considering it?

Source: Intel

Intel Xeon Phi to get Serious Refresh in 2015?

Subject: General Tech, Graphics Cards, Processors | November 28, 2013 - 08:30 AM |
Tagged: Intel, Xeon Phi, gpgpu

Intel was testing the waters with their Xeon Phi co-processor. Based on the architecture designed for the original Pentium processors, it was released in six products ranging from 57 to 61 cores and 6 to 16GB of RAM. This lead to double precision performance of between 1 and 1.2 TFLOPs. It was fabricated using their 22nm tri-gate technology. All of this was under the Knights Corner initiative.

Intel_Xeon_Phi_Family.jpg

In 2015, Intel plans to have Knights Landing ready for consumption. A modified Silvermont architecture will replace the many simple (basically 15 year-old) cores of the previous generation; up to 72 Silvermont-based cores (each with 4 threads) in fact. It will introduce the AVX-512 instruction set. AVX-512 allows applications to vectorize 8 64-bit (double-precision float or long integer) or 16 32-bit (single-precision float or standard integer) values.

In other words, packing a bunch of related problems into a single instruction.

The most interesting part? Two versions will be offered: Add-In Boards (AIBs) and a standalone CPU. It will not require a host CPU, because of its x86 heritage, if your application is entirely suited for an MIC architecture; unlike a Tesla, it is bootable with existing and common OSes. It can also be paired with standard Xeon processors if you would like a few strong threads with the 288 (72 x 4) the Xeon Phi provides.

And, while I doubt Intel would want to cut anyone else in, VR-Zone notes that this opens the door for AIB partners to make non-reference cards and manage some level of customer support. I'll believe a non-Intel branded AIB only when I see it.

Source: VR-Zone

Intel claims Knight's Landing will slay HUMA and bare all CUDA's flaws

Subject: General Tech | November 20, 2013 - 05:53 PM |
Tagged: Xeon Phi, knights landing, Intel, 14nm

Intel has been talking up the Xeon Phi, first of the Knight's Landing chips which shall arrive in the not too distant future.  This new architecture is touted to bring a return of homogeneous systems architecture which will perform parallel processing on its many cores, currently 61 is the number being tossed around, at a level of performance that will exceed the GPU accelerated heterogeneous architecture being pushed by AMD and NVIDIA.  Whether this is true or not remains to be seen but many server builders may prefer the familiar CPU only architecture and as at least some of the Phi's will be available in rack mounted form and not just addin cards they may choose Intel out of habit.   You can also read about Micron's Automata Processor which The Register reports can outperform a 48-chip cluster of Intel Xeon 5650s in certain scenarios.

KNOTS01.jpg

"From Intel's point of view, today's hottest trend in high-performance computing – GPU acceleration – is just a phase, one that will be superseded by the advent of many-core CPUs, beginning with Chipzilla's next-generation Xeon Phi, codenamed "Knights Landing"."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

Dell Unveils New T3610, T5610, and T7610 Workstations

Subject: General Tech, Systems | September 9, 2013 - 01:00 PM |
Tagged: Xeon Phi, workstation, quadro, micron, LSI, k6000, Ivy Bridge-EP, firepro, dell

Along with the release of new mobile workstations, Dell announced three new desktop workstations. Specifically, Dell is launching the T3610, T5610, and T7610 PC workstations under its Precision series. The new systems reside in redesigned cases with improved cable management, removable power supplies (tool-less, removable by sliding out from rear panel), and in the case of the T7610 removable hard drives. All of the new Precision workstations have been outfitted with Intel's latest Ivy Bridge-EP based Xeon processors, ECC memory, workstation-class graphics cards from AMD and NVIDIA, Xeon Phi accelerator card options, LSI hardware RAID controllers, and updated software solutions from Intel and Dell.

Dell Precision T3610 T5610 T7610.jpg

The new Precision workstations side-by-side. From left to right: T3610, T5610, and T7610.

Dell's Precision T3610 is a the mid-tower system of the group powered by single socket Xeon E5-2600 v2 hardware that further supports up to 128GB DDR3 ECC memory, two graphics cards, three 3.5” hard drives, and four 2.5” SSDs.

Dell Precision T3610 Single Xeon Ivy Bridge-EP Workstation.jpg

The Precision T3610, a new single socket, mid-range workstation.

The Precision T5610 ups the ante to a dual socket IVB-EP processor system that can be configured with up to 128GB DDR3 ECC memory, two AMD FirePro or NVIDIA Quadro (e.g. Quadro K5000) graphics cards, a Tesla K20C accelerator card, three 3.5” hard drives, and four 2.5” solid state drives.

Finally, the T7610 workstation supports dual Intel Ivy Bridge-EP Xeon E5-2600 v2 series processors (up to 24 cores per system), up to 512GB DDR3 ECC memory, three graphics cards (including two NVIDIA Quadro K6000 cards), four 3.5” hard drives, and eight 2.5” SSDs.

Dell Precision T5610 Dual Xeon Ivy Bridge-EP Workstation.jpg

Dell's Precision T5610 dual socket workstation.

The new Precision workstations can also be configured with an Intel Xeon Phi 3120A accelerator card in lieu of a Tesla card. The choice will mainly depend on the applications being used and the development resources and expertise available. Both options are designed to accelerate highly parallel workloads in applications that have been compiled to support them. Further, users can add an LSI hardware RAID card with 1GB of onboard memory to the systems. Dell further offers a Micron P320h PCI-E SSD that, while not bootable, offers up 350GB of high performance storage that excels at high sequential reads and writes.

On the software front, Dell is including the Dell Precision Performance Optimizer and the Intel Cache Acceleration Software. The former automatically configures and optimizes the workstation for specific applications based on profiles that are reportedly regularly updated. The other bit of software works to optimize systems that use both hard drives and SSDs with the SSDs as a cache for the mechanical storage. The Intel Cache Acceleration Software configures the caching algorithms to favor caching very large files on the solid state storage. It is a different approach to consumer caching strategies, but one that works well with businesses that use these workstations to process large data sets.

Dell Precision T7610 Dual Xeon Ivy Bridge-EP Workstation.jpg

The Dell Precision T7610 workstation.

The Dell workstations are aimed at businesses doing scientific analysis, professional engineering, and complex 3D modeling. The T7610 in particular is aimed at the oil and gas industry for use in simulations and modeling as companies search for new oil deposits.

All three systems will be available for purchase worldwide beginning September 12th. Some of the options, such as 512GB of ECC and the NVIDIA Quadro K6000 on the T7610 will not be available until next month, however. The T3610 has a starting price of $1,099 while the T5610 and T7610 have starting prices of $2,729 and $3,059 respectively.

What are your thoughts on Dell's new mid-tower workstations?

Source: Dell

The Titan's Overthrown. Tianhe-2 Supercomputer New #1

Subject: General Tech, Processors, Systems | June 27, 2013 - 02:27 AM |
Tagged: supercomputing, supercomputer, titan, Xeon Phi

The National Supercomputer Center in Guangzho, China, will host the the world's fastest supercomputer by the end of the year. The Tianhe-2, English: "Milky Way-2", is capable of nearly double the floating-point performance of Titan albeit with slightly less performance per watt. The Tianhe-2 was developed by China's National University of Defense Technology.

tianhe-2-jack-dongarra-pdf-600x0.jpg

Photo Credit: Top500.org

Comparing new fastest computer with the former, China's Milky Way-2 is able to achieve 33.8627 PetaFLOPs of calculations from 17.808 MW of electricity. The Titan, on the other hand, is able to crunch 17.590 PetaFLOPs with a draw of just 8.209 MW. As such, the new Milky Way-2 uses 12.7% more power per FLOP than Titan.

Titan is famously based on the Kepler GPU architecture from NVIDIA, coupled with several 16-core AMD Opteron server processors clocked at 2.2 GHz. This concept of using accelerated hardware carried over into the design of Tianhe-2, which is based around Intel's Xeon Phi coprocessor. If you include the simplified co-processor cores of the Xeon Phi, the new champion is the sum of 3.12 million x86 cores and 1024 terabytes of memory.

... but will it run Crysis?

... if someone gets around to emulating DirectX in software, it very well could.

Source: Top500