Altera Does FPGAs with OpenCL

Subject: General Tech, Graphics Cards | October 16, 2013 - 10:00 PM |
Tagged: FPGA, Altera

(Update 10/17/2013, 6:13 PM) Apparently I messed up inputing this into the website last night. To compare FPGAs with current hardware, the Altera Stratix 10 is rated at more than 10 TeraFLOPs compared to the Tesla K20X at ~4 TeraFLOPs or the GeForce Titan at ~4.5 TeraFLOPs. All figures are single precision. (end of update)

Field Programmable Gate Arrays (FPGAs) are not general purpose processors; they are not designed to perform any random instruction at any random time. If you have a specific set of instructions that you want performed efficiently, you can spend a couple of hours compiling your function(s) to an FPGA which will then be the hardware embodiment of your code.

This is similar to an Application-Specific Integrated Circuit (ASIC) except that, for an ASIC, it is the factory who bakes your application into the hardware. Many (actually, to my knowledge, almost every) FPGAs can even be reprogrammed if you can spare those few hours to configure it again.

View Full Size

Altera is a manufacturer of FPGAs. They are one of the few companies who were allowed access to Intel's 14nm fabrication facilities. Rahul Garg of Anandtech recently published a story which discussed compiling OpenCL kernels to FPGAs using Altera's compiler.

Now this is pretty interesting.

The design of OpenCL splits work between "host" and "kernel". The host application is written in some arbitrary language and follows typical programming techniques. Occasionally, the application will run across a large batch of instructions. A particle simulation, for instance, will require position information to be computed. Rather than having the host code loop through every particle and perform some complex calculation, what happens to each particle could be "a kernel" which the host adds to the queue of some accelerator hardware. Normally, this is a GPU with its thousands of cores chunked into groups of usually 32 or 64 (vendor-specific).

View Full Size

An FPGA, on the other hand, can lock itself to the specific set of instructions. It can decide to, within a few hours, configure some arbitrary number of compute paths and just churn through each kernel call until it is finished. The compiler knows exactly the application it will need to perform while the host code runs on the CPU.

This is obviously designed for enterprise applications, at least as far into the future as we can see. Current models are apparently priced in the thousands of dollars but, as the article points out, has the potential to out-perform a 200W GPU at just a tenth of the power. This could be very interesting for companies, perhaps a film production house, who wants to install accelerator cards for sub-d surfaces or ray tracing but would like to develop the software in-house and occasionally update their code after business hours.

Regardless of the potential market, a FPGA-based add-in card simply makes sense for OpenCL and its architecture.

Source: Anandtech

October 17, 2013 | 06:35 AM - Posted by AMDbumlover (not verified)

hey scott, will you try to get a press copy to implement your 3d stuuf?

October 17, 2013 | 09:42 AM - Posted by Scott Michaud

Lol, that probably would not work. This generation apparently supports OpenCL 1.0 while WebCL requires OpenCL 1.1. They are talking about OpenCL 2.0 support soon so they might miss WebCL (until/unless we get WebCL2 at some point) entirely.

Also, WebCL does not support binary programs (for security reasons) so you would probably need to wait a couple of hours just to start the program (IIRC every time you hit F5).

Although that does not mean both technologies won't evolve in the same direction. This sort of thing is the main reason why I talk up the benefits of, "Going back to the fundamental math". Math does not change.

October 18, 2013 | 11:20 PM - Posted by Anonymous (not verified)

Hey, if you love fundamental math so much, limit yourself to functional programming. All the power and flexibility of abstract mathematics, with all the intuitiveness and ease of use of abstract mathematics! LOLOLOLOL

Seriously tho Scott, hurry up on your project, I can't wait to see when you finally figure out *WHY* DX and OpenGL evolved the way they did. DX and OGL are APIs for ASICs. Application Specific Integrated Circuits. Also known generically as "hardware accelerators".

You're not discovering a new paradigm Scott, you're reinventing the wheel, and not the modern wheel, the old wooden cart wheel. You're rediscovering software rendering, except you're doing it in an OpenCL environment instead of good old fashioned 1-core x86 environment.

General purpose computing hardware will never ever be as fast as application specific hardware. Never ever. That's kind of how the whole "graphics accelerator" thing came to be. You know, the modern GPU.

October 19, 2013 | 03:25 AM - Posted by Scott Michaud

You are absolutely correct: Application-Specific Integrated Circuits (ASICs) are far more efficient than general purpose units. The FPGAs mentioned in this article have about a 1/10th power draw for about double performance of a modern GPU (and they aren't even ASICs). Also, by converting these algorithms to general purpose code, I am taking away some silicon (which would otherwise be free) to perform calculations which should be baked in some hardware.

You'll also notice I have never claimed speed as an advantage.

The reason for this project is control. The reason is to go back to the fundamental math and see how the fixed-functionality helps us and how it gets in the way.

But, by far, the biggest problem is: "What is the performance hit?" Pretend that the fixed function hardware is 100x more efficient than general purpose code. Great. What if your specific application is *massively* shader-bound, with 95% of overall computation burdonned to the compute cores? Okay, so the ASICs saved 99% of the 5% hit to real performance (numbers completely pulled out of butt to illustrate budgetting). What if it is 10% of real performance? 1%? 50%? You kind-of need to quantify that before you can start making broad claims that it'll be a horrible experience. Then, what did you compromise for that gain?

And again, I'm not even talking about using higher-performance data structures and algorithms for your specific application, although that is possible. What about higher-quality algorithms? Are you limiting the platforms on which your application will run? Are you limiting what hardware you can access? How you can access it? Complexity? Bugs and glitches you may or may not have control over?

It's a project.