Intel Launches Stratix 10 FPGA With ARM CPU and HBM2

Subject: Processors | October 10, 2016 - 02:25 AM |
Tagged: SoC, Intel, FPGA, Cortex A53, arm, Altera

 Intel and recently acquired Altera have launched a new FPGA product based on Intel’s 14nm Tri-Gate process featuring an ARM CPU, 5.5 million logic element FPGA, and HBM2 memory in a single package. The Stratix 10 is aimed at data center, networking, and radar/imaging customers.

The Stratix 10 is an Altera-designed FPGA (field programmable gate array) with 5.5 million logic elements and a new HyperFlex architecture that optimizes registers, pipeline, and critical pathing (feed-forward designs) to increase core performance and increase the logic density by five times that of previous products. Further, the upcoming FPGA SoC reportedly can run at twice the core performance of Stratix V or use up to 70% less power than its predecessor at the same performance level.

View Full Size

The increases in logic density, clockspeed, and power efficiency are a combination of the improved architecture and Intel’s 14nm FinFET (Tri-Gate) manufacturing process.

Intel rates the FPGA at 10 TFLOPS of single precision floating point DSP performance and 80 GFLOPS/watt.

Interestingly, Intel is using an ARM processor to feed data to the FPGA chip rather than its own Quark or Atom processors. Specifically, the Stratix 10 uses an ARM CPU with four Cortex A53 cores as well as four stacks of on package HBM2 memory with 1TB/s of bandwidth to feed data to the FPGA. There is also a “secure device manager” to ensure data integrity and security.

The Stratix 10 is aimed at data centers and will be used with in specialized tasks that demand high throughput and low latency. According to Intel, the processor is a good candidate for co-processors to offload and accelerate encryption/decryption, compression/de-compression, or Hadoop tasks. It can also be used to power specialized storage controllers and networking equipment.

Intel has started sampling the new chip to potential customers.

View Full Size

In general, FPGAs are great at highly parallelized workloads and are able to efficiently take huge amounts of inputs and process the data in parallel through custom programmed logic gates. An FPGA is essentially a program in hardware that can be rewired in the field (though depending on the chip it is not necessarily a “fast” process and it can take hours or longer to switch things up heh). These processors are used in medical and imaging devices, high frequency trading hardware, networking equipment, signal intelligence (cell towers, radar, guidance, ect), bitcoin mining (though ASICs stole the show a few years ago), and even password cracking. They can be almost anything you want which gives them an advantage over traditional CPUs and graphics cards though cost and increased coding complexity are prohibitive.

The Stratix 10 stood out as interesting to me because of its claimed 10 TFLOPS of single precision performance which is reportedly the important metric when it comes to training neural networks. In fact, Microsoft recently began deploying FPGAs across its Azure cloud computing platform and plans to build the “world’s fastest AI supercomputer. The Redmond-based company’s Project Catapult saw the company deploy Stratix V FPGAs to nearly all of its Azure datacenters and is using the programmable silicon as part of an “acceleration fabric” in its “configurable cloud” architecture that will be used initially to accelerate the company’s Bing search and AI research efforts and later by independent customers for their own applications.

It is interesting to see Microsoft going with FPGAs especially as efforts to use GPUs for GPGPU and neural network training and inferencing duties have increased so dramatically over the years (with NVIDIA being the one pushing the latter). It may well be a good call on Microsoft’s part as it could enable better performance and researchers would be able to code their AI accelerator platforms down to the gate level to really optimize things. Using higher level languages and cheaper hardware with GPUs does have a lower barrier to entry though. I suppose ti will depend on just how much Microsoft is going to charge customers to use the FPGA-powered instances.

FPGAs are in kind of a weird middle ground and while they are definitely not a new technology, they do continue to get more complex and powerful!

What are your thoughts on Intel's new FPGA SoC?

Also read:

Source: Intel

October 10, 2016 | 03:03 AM - Posted by Anonymous (not verified)

I'm less surprised by the ARM cores, and more by then use of HBM: Intel and Micron are developing HMC, and HMC is in use inside the Knights Corner package (the on-package MCDRAM). The primary difference between HMC and HBM is that HBM requires the use of an expensive silicon interposer, which also limits total die size. HMC uses a regular PCB to carry interconnects, so the packaging is much cheaper and more scalable, at the expense of the HMC modules themselves having a more complex IO plane.

October 10, 2016 | 08:21 AM - Posted by BlackDove (not verified)

Same here. Didnt expect to see HBM on anything Intel.

October 10, 2016 | 12:08 PM - Posted by Dvon-E (not verified)

Perhaps this product needed the word-width and asynchronous I/O of HBM over the serialized streams of HMC, despite the cost of NIH (not invented here.) Differences between Wide IO, HBM, and HMC

October 10, 2016 | 11:15 AM - Posted by Anonymous (not verified)

It's for the server market and that market can afford silicon interposers. Also the more silicon interposers/IP is used the cheaper they will become. And it's not the silicon interposer that is that expensive it's the packaging technology/process for attaching the various dies to the interposer that costs. But HBM2 can be clocked much lower with the wide parallel traces allowing for higher effective bandwidth at those lower power saving clocks that the server market likes. So the server/HPC markets will help pay down the R&D costs and engineering costs of HBM2/newer HBM technology for all the markets that make use of HBM2/newer. They are working on organic/polymer interposer technology to allow for more savings. The server market has the economy of scale to help in bringing the silicon interposer costs down more quickly through economies of scale.

Intel went with HBM2 because HBM2 is what more of the market is using, so the costs will be less all around for Intel/others to use the JEDEC standard HBM2 technology. The server market is becoming very competitive and Intel has to have the lowest costs possible. Intel has to compete with Power8/Power9 and Nvidia’s GPU accelerators and AMD’s Zen HPC/server CPUs and new Zen HPC/server APUs. AMD’s Server/HPC APUs will also come with FPGAs on the HBM2 stacks, to augment the on interposer based Zen and Vega GPU die. So Intel will have to compete and even try to get some FPGA sales business from both IBM and maybe AMD/Others.

October 13, 2016 | 07:24 PM - Posted by Anonymous (not verified)

Nvidia already have Tesla P100 using HBM2, but AMD products using HBM2 still non existant. AMD has not announced the usage of FPGA in any of their products, as they are concentrating on GPU instead (like Nvidia). Also Zen based APU is still very far away, mainly in vaporware state and unlikely to use HBM2.

October 10, 2016 | 03:06 AM - Posted by Anonymous (not verified)

"Intel rates the FPGA at 10 TFLOPS of single precision floating point DSP performance and 80 FLOPS/watt."
So it draws 125 Gigawatts?

October 10, 2016 | 05:34 AM - Posted by Katalmach (not verified)

Yeap. An LMS100 (116MW output) isn't enough to power this consuming monster...

October 10, 2016 | 05:36 AM - Posted by Katalmach (not verified)

edit: I just noticed the "giga"watts...
So you need 1078 of these, lol...

October 12, 2016 | 12:42 AM - Posted by Tim Verry

Heh 80 GFLOPS per watt of single precision floating point performance. 80 FLOPS/watt was a typo. Sorry about that!

EDIT: You got me thinking about the TDP which doens't look like they outright said. But if it's 10 TFLOPS and 80 GFLOPS/watt then the TDP is 125W? Does that math sound correct?

October 13, 2016 | 03:36 PM - Posted by Stefem (not verified)

Probably, especially given the nature of that processor, the FLOPS/Watt value has been obtained in the best case scenario, in other words while not delivering 10 TFLOPS.
For example the NVIDIA GP104 based Tesla P4 reach similar value of power efficiency sacrificing FLOPS for less power consumption.

October 10, 2016 | 08:56 AM - Posted by patrickjp93 (not verified)

Would you please edit this article to the correct 80GFlops/Watt metric?

October 10, 2016 | 10:45 AM - Posted by Anonymous (not verified)

So these have ASIC like FP units that are available to the FPGA logic.

Here is an EE times article(1) from 2014 by someone from Altera. It quotes the very same 10,000 GFLOPs and I guess that is where Intel has the license for the ARM cores. Intel now owns Altera. I guess that Intel has made this using a smaller process node so the power savings are there. It's great that Intel is using HBM2. As the more products that make use of HBM2 the lower the cost of HBM2 will become, what with the economy of scale for HBM2 allowing for the engineering and R&D costs to amortized more quickly. The more products using HBM2 the better for lower cost HBM2 for GPU usage so maybe there can be more than just Flagship GPU usage of HBM2 in the future for the GPU and APU/SOC markets.

"Understanding Peak Floating-Point Performance Calculations"
Michael Parker, Altera
10/20/2014 02:03 PM EDT

October 10, 2016 | 10:52 AM - Posted by Anonymous (not verified)

It looks like the link generates an error, so you will have to Google the article's title "Understanding Peak Floating-Point Performance Calculations" to be linked to the article, I can not figure out why the link will not work.

October 12, 2016 | 12:48 AM - Posted by Tim Verry

Hmm the link works for me...

October 10, 2016 | 03:18 PM - Posted by Anonymous (not verified)

"interestingly, Intel is using an ARM processor to feed data to the FPGA chip rather than its own Quark or Atom processors."

The question is whether Intel uses quad-core Cortex A53 because:
a) it's more energy/performance efficient
b) it's more cost/performance efificent
c) Both "a" and "b" . It's more cost/energy/performance efficient than Intel Atom/Quark

October 10, 2016 | 04:58 PM - Posted by Anonymous (not verified)

D) Altera in 2014 was using the ARM processor in their version of the same FPGA product and Intel used that version with a process node shrink to get something on the market sooner rather than later. See the EE times article linked to above! And it's using an ARM A53 reference design core also. So Intel did not have the time to get its x86 version certified to work with the Altera IP, and Altera’s ARM license is now Intel’s ARM license. I’ll bet that Intel sticks with the ARM version for a while longer because Intel does not have enough time to get something to market to compete with Nvidia’s GPU accelerated competition. It would take too much time to develop the software and configure the Altera FPGA products to work with x86 based cores and certify a new from the ground up product, time that Intel does not have at the moment.

P.S. Altera’s clients are now Intel's clients so Intel may be contractually bound to support the ARM version for some time depending on what contracts existed before Intel purchased Altera.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.