JavaOne 2013: GPU Is Coming Whether You Know It or Not

Subject: General Tech, Shows and Expos | September 24, 2013 - 01:38 AM |
Tagged: JavaOne, JavaOne 2013, gpgpu

Are the enterprise users still here? Oh, hey!

GPU acceleration throws a group of many similar calculations at thousands of simple cores. Their architecture makes it very cheap and power efficient for the amount of work they achieve. Gamers, obviously, enjoy the efficiency at tasks such as calculating pixels on a screen or modifying thousands of vertex positions. This technology has evolved more generally than graphics. Enterprise and research applications have been taking notice over the years.

GPU discussion, specifically, starts around 16 minutes.

Java, a friend of scientific and "big-data" developers, is also evolving in a few directions including "offload".

IBM's CTO of Java, John Duimovich, discussed a few experiments they created when optimizing the platform to use new hardware. Sorting arrays, a common task, saw between a 2-fold and 48-fold increase of performance. Including the latency of moving data and initializing GPU code, a 32,000-entry array took less than 1.5ms to sort, compared to about 3ms on the CPU. The sample code was programmed in CUDA.

The goal of these tests is, as far as I can tell, to (eventually) automatically use specialized hardware for Java's many built-in libraries. The pitch is free performance. Of course there is only so much you can get for free. Still, optimizing the few usual suspects is an obvious advantage, especially if it just translates average calls to existing better-suited libraries.

Hopefully they choose to support more than just CUDA whenever they take it beyond experimentation. The OpenPOWER Consortium, responsible for many of these changes, currently consists of IBM, Mellanox, TYAN, Google, and NVIDIA.

Source: JavaOne
Manufacturer: Adobe

OpenCL Support in a Meaningful Way

Adobe had OpenCL support since last year. You would never benefit from its inclusion unless you ran one of two AMD mobility chips under Mac OSX Lion, but it was there. Creative Cloud, predictably, furthers this trend with additional GPGPU support for applications like Photoshop and Premiere Pro.

This leads to some interesting points:

  • How OpenCL is changing the landscape between Intel and AMD
  • What GPU support is curiously absent from Adobe CC for one reason or another
  • Which GPUs are supported despite not... existing, officially.


This should be very big news for our readers who do production work whether professional or for a hobby. If not, how about a little information about certain GPUs that are designed to compete with the GeForce 700-series?

Read on for our thoughts, after the break.

GTC 2013: Fuzzy Logix Launches Tanay Rx for GPU Accelerating Analytic Models Programmed In R

Subject: General Tech | March 27, 2013 - 03:40 AM |
Tagged: GTC 2013, gpu analytics, gpgpu, fuzzy logix

Fuzzy Logix, a company that specializes in HPC data analytics, recently unveiled a new extension (to the Tanay Zx library) called Tanay Rx that will GPU accelerate analytic models written in R. R is a programming language commonly used by statisticians. It is reportedly relatively easy to program, but has an inherent lack of multi-threading performance and memory limitations. With Tanay Rx, Fuzzy Logix is hoping to combine the performance benefits of its Tanay Zx libraries with the simplicity of R programming. According to Fuzzy Logix, Tanay Rx is "the perfect prescription to cure performance issues with R."


Tanay Zx allowed the use of many programming languages to run models with .net, .dll, or shared object calls on the GPU, and the new Tanay Rx extension extends that functionality to statistical and analytic models run using R. Models include those data intensive tasks as matrix operations, Monte Carlo simulations, data mining, financial mathematics (equities, fixed income, and time series analysis). Fuzzy Logix claims to enable R users to run over 500 analytic models up to 10 to 100-times faster by harnessing the parallel processing power of graphics and accelerator cards such as NVIDIA's Quadro/Tesla cards, Intel's MIC, and AMD's FirePro cards.

As an example, Fuzzy Logix states that calculations for intra-day risk of equity, interest rate, and FX options amount to approximately 1 billion future scenarios can be performed in milliseconds on the GPU. While some conversions may be more intensive, certain aspects of R code can be sped-up by replacing R functions with Fuzzy Logix' own Tanay Rx functions.

Fuzzy Logix CUDA Function.jpg

As per Fuzzy Logix's website.

Industry solutions implementing Tanay Rx for the financial, healthcare, internet marketing, pharmaceutical, oil, gas, insurance, and other sectors are available now. More information on the company's approach to GPGPU analytics is available here.

Source: Fuzzy Logix

GTC 2013: Cortexica Vision Systems Talks About the Future of Image Recognition During the Emerging Companies Summit

Subject: General Tech, Graphics Cards | March 21, 2013 - 01:44 AM |
Tagged: video fingerprinting, image recognition, GTC 2013, gpgpu, cortexica, cloud computing

The Emerging Companies Summit is an series of sessions at NVIDIA's GPU Technology Conference (GTC) that gives the floor to CEOs from several up-and-coming technology startups. Earlier today, the CEO of Cortexica Vision Systems took the stage to talk briefly about the company's products and future direction, and to answer questions from a panel of industry experts.

If you tuned into NVIDIA's keynote presentation yesterday, you may have noticed the company showing off a new image recognition technology. That technology is being developed by a company called Cortexica Vision Systems. While it cannot perform facial recognition, it is capable of identifying everything else, according the company's CEO Ian McCready. Currently, Cortexica is employing a cluster of approximately 70 NVIDIA graphics cards, but it is capable of scaling beyond that. Mcready estimates that about 100 GPUs and a CPU would be required by a company like eBay, should they want to implement Cortexica's image recognition technology in-house.


The Cortexica technology uses images captured by a camera (such as the one in your smartphone), which is then sent to Cortexica's servers for processing. The GPUs in the Cortexica cluster handle the fingerprint creation task while the CPU does the actual lookup in the database of known fingerprints to either find an exact match, or return similar image results. According to Cortexica, the fingerprint creation takes only 100ms, though as more powerful GPUs make it into mobile devices, it may be possible to do the fingerprint creation on the device itself, reducing the time between taking a photo and getting relevant results back.


The image recognition technology is currently being used by Ebay Motors in the US, UK, and Germany. Cortexica hopes to find a home with many of the fashion companies that would use the technology to allow people to identify and ultimately purchase clothing they take photos of on television or in public. The technology can also perform 360-degree object recognition, identify logos that are as small as .4% of the screen, and identify videos. In the future Cortexica hopes to reduce latency, improve recognition accuracy, and add more search categories. Cortexica is also working on enabling an "always on" mobile device that will constantly be indentifying everything around it, which is both cool and a bit creepy. With mobile chips like Logan and Parker coming in the future, Cortexica hopes to be able to do on-device image recognition, which would greatly reduce latency and allow the use of the recognition technology while not connected to the internet.


The number of photos taken is growing rapidly, where as many as 10% of all photos stored "in the cloud" were taken last year alone. Even Facebook, with it's massive data centers is moving to a cold-storage approach to save money on electricity costs of storing and serving up those photos. And while some of these photos have relevant meta data, the majority of photos taken do not, and Cortexica claims that its technology can be used to get around that issue, but identifying photos as well as finding similar photos using its algorithms.


Stay tuned to PC Perspective for more GTC coverage!

Additional slides are available after the break:

Too good to be true; bad coding versus GPGPU compute power

Subject: General Tech | November 23, 2012 - 06:03 PM |
Tagged: gpgpu, amd, nvidia, Intel, phi, tesla, firepro, HPC

The skeptics were right to question the huge improvements seen when using GPGPUs in a system for heavy parallel computing tasks.  The cards do help a lot but the 100x improvements that have been reported by some companies and universities had more to do with poorly optimized CPU code than with the processing power of GPGPUs.  This news comes from someone who you might not expect to burst this particular bubble, Sumit Gupta is the GM of NVIDIA's Tesla team and he might be trying to mitigate any possible disappointment from future customers which have optimized CPU coding and won't see the huge improvements seen by academics and other current customers.  The Inquirer does point out a balancing benefit, it is obviously much easier to optimize code in CUDA, OpenCL and other GPGPU languages than it is to code for multicored CPUs.


"Both AMD and Nvidia have been using real-world code examples and projects to promote the performance of their respective GPGPU accelerators for years, but now it seems some of the eye popping figures including speed ups of 100x or 200x were not down to just the computing power of GPGPUs. Sumit Gupta, GM of Nvidia's Tesla business told The INQUIRER that such figures were generally down to starting with unoptimised CPU."

Here is some more Tech News from around the web:

Tech Talk

Source: The Inquirer

AMD Launches Dual Tahiti FirePro S10000 Graphics Card

Subject: Graphics Cards | November 13, 2012 - 09:15 PM |
Tagged: tahiti, HPC, gpgpu, firepro s10000, firepro

On Monday, AMD launched its latest graphics card aimed at the server and workstation market. Called the AMD FirePro S10000 (for clarity, that’s FirePro S10,000), it is a dual GPU Tahiti graphics card that offers up some impressive performance numbers.

No, unfortunately, this is not the (at this point) mythical dual-7970 AMD HD 7990 graphics card. Rather, the FirePro S10,000 is essentially two Radeon 7950 GPUs on a single PCB along with 6 GB of GDDR5 memory. Specifications on the card include 3,584 stream processors, a GPU clock speed of 825 MHz, and 6 GB GDDR5 with a total of 480 GB/s of memory bandwidth. That is 1,792 stream processors and 3 GB of memory per GPU. Interestingly, this is a dual slot card with an active cooler. At 375W, a passive cooler is just not possible in a form factor necessary to fit into a server rack. Therefore, AMD has equipped the FirePro S10,000 GPGPU card with a triple fan cooler reminiscant of the setup PowerColor uses on its custom (2x7970) Devil 13, but not as large. The FirePro card has three red fans (shrouded by a black cover) over a heatpipe and aluminum fin heatsink. The card does include display outputs for workstation uses including one DVI and four mini DisplayPort ports.


AMD is claiming 1.48 TFLOPS in double precision work and 5.91 TFLOPS in single precision workloads. Those are impressive numbers, and the card even manages to beat NVIDIA’s new Tesla K20X with big Kepler GK110 and the company’s dual GPU GK104 Tesla K10 by notable margins. Additionally, the new FirePro S10000 manages to beat its FirePro 9000 predecessor handily. The S9000 in comparison is rated at 0.806 TFLOPS for double precision calculations and 3.23 TFLOPS on single precision work. The S9000 is a single GPU card equivalent to the Radeon 7950 on the consumer side of things with 1,792 shader cores. AMD has essentially taken two S9000 cards and put them on a single PCB, and managed to get almost twice the potential performance without needing twice the power.

Efficiency and calculations per watt were numbers that AMD did not dive too much into, but the company did share that the new FirePro S10000 achieves 3.94 GLOPS/W. AMD compares this to NVIDIA’s dual GPU (Fermi-based) Tesla M2090 at 2.96 GFLOPS/W. Unfortunately, NVIDIA has not shared a single GPU GFLOPS/W rating on its new K20X cards.

Double Precision 1.48 TF 0.806 TF 1.31 TF 0.19 TF
Single Precision 5.91 TF 3.23 TF 3.95 TF 4.58 TF
Architecture Tahiti (x2) Tahiti (x1) GK110 GK104 (x2)
TDP 375W 225W 235W 225W
Memory Bandwidth 480 GB/s 264 GB/s 250 GB/s 320 GB/s
Memory Capacity 6 GB 6 GB 6 GB 8 GB
Stream Processors 3,584 1,792 2,688 3,070
Core clock speed 825 MHz 900 MHz 732 MHz 745 M
MSRP $3,599 $2,499 $3,199 ~$2500

Other features of the AMD FirePro S10000 include support for OpenCL, Microsoft RemoteFX, Direct GPU pass-through, and (shared) virtualized graphics. AMD envisions businesses using these FirePro cards to provide GPU hardware acceleration for virtualized desktops and thin clients. With Xen Server, multiple users are able to tap into the hardware acceleration offered by the FirePro S10000 to speed up desktop and speed up programs that support it.

Operating systems in particular have begun tapping into GPU acceleration to speed up the user interface and run things like the Aero desktop in Windows 7. High end software for workstations also have a high GPU acceleration adoption rate, so there are benefits to be had, and AMD is continuing to offer it with its latest FirePro card.

AMD FirePro Market Position_Aimed at Server Graphics.png

AMD is offering up a card that can be used for a mix of compute or graphics output, making them an interesting choice for workstations. The FirePro S10000’s major fault lies with a 375W TDP, and while the peak performance is respectable it is going to use more power while provided that compute muscle.

The cards are available now with an MSRP of $3,599. It is neat to finally see AMD come out with a dual GPU card with Tahiti chips, and it will be interesting to see what kind of design wins the company is able to get for its beastly FirePro S10000.

GPU: Finally getting recognition from Windows.

Subject: General Tech, Graphics Cards, Systems, Mobile | July 27, 2012 - 06:12 PM |
Tagged: windows 8, winRT, gpgpu

Paul Thurrott of Windows Supersite reports that Windows 8 is finally taking hardware acceleration seriously and will utilize the GPU across all applications. This hardware acceleration should make Windows 8 perform better and consume less power than if the setup were running Windows 7. With Microsoft finally willing to adopt modern hardware for performance and battery life I wonder when they will start using the GPU to accelerate tasks like file encryption.

It is painful when you have the right tool for the job but must use the wrong one.

Windows has, in fact, used graphics acceleration for quite some time albeit in fairly mundane and obvious ways. Windows Vista and Windows 7 brought forth the Windows Aero Glass look and feel. Aero was heavily reliant on Shader Model 2.0 GPU computing to the point that much of it would not run on anything less.


Washington State is not that far away from Oregon.

Microsoft is focusing their hardware acceleration efforts for Windows 8 on what they call mainstream graphics. 2D graphics and animation were traditionally CPU-based with a couple of applications such as Internet Explorer 9, Firefox, and eventually Chrome allowing the otherwise idle GPU to lend a helping hand. As such, Microsoft is talking up Direct2D and DirectWrite usage all throughout Windows 8 on a wide variety of hardware.

The driving force that neither Microsoft nor Paul Thurrott seems to directly acknowledge is battery life. Graphics Processors are considered power-hogs until just recently for almost anyone who assembles a higher-end gaming computer.  Despite this, the GPU is actually more efficient at certain tasks than a CPU -- this is especially true when you consider the GPUs which will go into WinRT devices. The GPU will help the experience be more responsive and smooth but also consume less battery power. I guess Microsoft is finally believes that the time is right to bother using what you already have.

There are many more tasks which can be GPU accelerated than just graphics -- be it 3D or the new emphasis on 2D acceleration. Hopefully after Microsoft dips in their toe they will take the GPU more seriously as an all-around parallel task processor. Maybe now that they are implementing the GPU for all applications they can consider using it for all applications -- in all applications.

Intel Introduces Xeon Phi: Larrabee Unleashed

Subject: Processors | June 19, 2012 - 03:46 PM |
Tagged: Xeon Phi, xeon e5, nvidia, larrabee, knights corner, Intel, HPC, gpgpu, amd

Intel does not respond well when asked about Larabee.  Though Intel has received a lot of bad press from the gaming community about what they were trying to do, that does not necessarily mean that Intel was wrong about how they set up the architecture.  The problem with Larrabee was that it was being considered as a consumer level product with an eye for breaking into the HPC/GPGPU market.  For the consumer level, Larrabee would have been a disaster.  Intel simply would not have been able to compete with AMD and NVIDIA for gamers’ hearts.
The problem with Larrabee and the consumer space was a matter of focus, process decisions, and die size.  Larrabee is unique in that it is almost fully programmable and features really only one fixed function unit.  In this case, that fixed function unit was all about texturing.    Everything else relied upon the large array of x86 processors and their attached vector units.  This turns out to be very inefficient when it comes to rendering games, which is the majority of work for the consumer market in graphics cards.  While no outlet was able to get a hold of a Larrabee sample and run benchmarks on it, the general feeling was that Intel would easily be a generation behind in performance.  When considering how large the die size would have to be to even get to that point, it was simply not economical for Intel to produce these cards.
Xeon Phi is essentially an advanced part based on the original Larrabee architecture.
This is not to say that Larrabee does not have a place in the industry.  The actual design lends itself very nicely towards HPC applications.  With each chip hosting many x86 processors with powerful vector units attached, these products can provide tremendous performance in HPC applications which can leverage these particular units.  Because Intel utilized x86 processors instead of the more homogenous designs that AMD and NVIDIA use (lots of stream units doing vector and scalar, but no x86 units or a more traditional networking fabric to connect them).  This does give Intel a leg up on the competition when it comes to programming.  While GPGPU applications are working with products like OpenCL, C++ AMP, and NVIDIA’s CUDA, Intel is able to rely on many current programming languages which can utilize x86.  With the addition of wide vector units on each x86 core, it is relatively simple to make adjustments to utilize these new features as compared to porting something over to OpenCL.
So this leads us to the Intel Xeon Phi.  This is the first commercially available product based on an updated version of the Larrabee technology.  The exact code name is Knights Corner.  This is a new MIC (many integrated cores) product based on Intel’s latest 22 nm Tri-Gate process technology.  The details are scarce on how many cores this product actually contains, but it looks to be more than 50 of a very basic “Pentium” style core;  essentially low die space, in-order, and all connected by a robust networking fabric that allows fast data transfer between the memory interface, PCI-E interface, and the cores.
Each Xeon Phi promises more than 1 TFLOP of performance (as measured by Linpack).  When combined with the new Xeon E5 series of processors, these products can provide a huge amount of computing power.  Furthermore, with the addition of the Cray interconnect technology that Intel acquired this year, clusters of these systems could provide for some of the fastest supercomputers on the market.  While it will take until the end of this year at least to integrate these products into a massive cluster, it will happen and Intel expects these products to be at the forefront of driving performance from the Petascale to the Exascale.
These are the building blocks that Intel hopes to utilize to corner the HPC market.  Providing powerful CPUs and dozens if not hundreds of MIC units per cluster, the potential computer power should bring us to the Exascale that much sooner.
Time will of course tell if Intel will be successful with Xeon Phi and Knights Corner.  The idea behind this product seems sound, and the addition of powerful vector units being attached to simple x86 cores should make the software migration to massively parallel computing just a wee bit easier than what we are seeing now with GPU based products from AMD and NVIDIA.  The areas that those other manufacturers have advantages over Intel are that of many years of work with educational institutions (research), software developers (gaming, GPGPU, and HPC), and industry standards groups (Khronos).  Xeon Phi has a ways to go before being fully embraced by these other organizations, and its future is certainly not set in stone.  We have yet to see 3rd party groups get a hold of these products and put them to the test.  While Intel CPUs are certainly class leading, we still do not know of the full potential of these MIC products as compared to what is currently available in the market.

The one positive thing for Intel’s competitors is that it seems their enthusiasm for massively parallel computing is justified.  Intel just entered that ring with a unique architecture that will certainly help push high performance computing more towards true heterogeneous computing. 

Source: Intel

CS6 OpenCL support -- not quite hardware acceleration for all

Subject: General Tech, Graphics Cards | May 19, 2012 - 07:27 AM |
Tagged: Adobe, CS6, gpgpu

Last month, SemiAccurate reported that Adobe Creative Suite 6 would be programmed around OpenCL which would allow any GPU to accelerate your work. Adobe now claims that OpenCL would only accelerate the HD6750M and the HD6770M running on OSX Lion with 1GB of vRAM on a MacBook Pro at least for the time being at least for Adobe Premiere Pro.

Does it aggravate you when something takes a while or stutters when you know a part of your PC is just idle?

Adobe has been increasingly moving to take advantage of the graphics processor available in your computer to benefit the professional behind the keyboard, mouse, or tablet. CS 5.5 pushed several of their applications on to the CUDA platform. End-users claim that Adobe sold them out for NVIDIA but that just seems unlikely and unlike either company. My prediction is and always was more that NVIDIA parachuted in some engineers to Adobe and their help was limited to CUDA.

Creative Suite 6 further suggests that I was correct as Adobe has gone back and re-authored much of those features in OpenCL.


Isn't it somewhat ironic that insanity is a symptom of mercury poisoning?

AMD as a hatter!

CS6 will not execute on just any old GPU now despite the wider availability of OpenCL relative to the somewhat NVIDIA proprietary CUDA. While the CUDA whitelist currently extends to 22 Windows NVIDIA GPUs and 3 Mac OSX NVIDIA GPUs current OpenCL support is limited to a pair of AMD-based OSX Lion mobile GPUs: the 6750M and the 6770M.

It would not surprise me if other GPUs would accelerate CS6 if manually added to a whitelist. Adobe probably is very conservative with what components they add to the whitelist in an effort to reduce support costs. That does not mean that you will see benefits even if you trick Adobe into accepting hardware acceleration though.

It appears as if Adobe is working towards using the most open and broad standards -- they just are doing it at their own pace this time. This release was obviously paced for Apple support.

Source: Adobe

NVIDIA urges you to program better now, not CPU -- later.

Subject: Editorial, General Tech, Graphics Cards, Processors | April 4, 2012 - 08:13 AM |
Tagged: nvidia, Intel, Knight's Corner, gpgpu

NVIDIA steals Intel’s lunch… analogy. In the process they claim that optimizing your application for Intel’s upcoming many-core hardware is not free of effort, and that effort is similar to what is required to develop on what NVIDIA already has available.

A few months ago, Intel published an article on their software blog to urge developers to look to the future without relying on the future when they design their applications. The crux of Intel’s argument states that regardless of how efficient Intel makes their processors, there is still responsibility on your part to create efficient code.


There’s always that one, in the back of the class…

NVIDIA, never a company to be afraid to make a statement, used Intel’s analogy to alert developers to optimize for many-core architectures.

The hope that unmodified HPC applications will work well on MIC with just a recompile is not really credible, nor is talking about ease of programming without consideration of performance.

There is no free lunch. Programmers will need to put in some effort to structure their applications for hybrid architectures. But that work will pay off handsomely for today’s, and especially tomorrow’s, HPC systems.

It remains to be seen how Intel MIC will perform when it eventually arrives. But why wait? Better to get ahead of the game by starting down the hybrid multicore path now.

NVIDIA thinks that Intel was correct: there would be no free lunch for developers, why not purchase a plate at NVIDIA’s table? Who knows, after the appetizer you might want to stay around.

You cannot simply allow your program to execute on Many Integrated Core (MIC) hardware and expect it to do so well. The goal is not to simply implement on new hardware -- it is to perform efficiently while utilizing the advantages of everything that is available. It will always be up to the developer to set up their application in the appropriate way.

Your advantage will be to understand the pros and cons of massive parallelism. NVIDIA, AMD, and now Intel have labored to create a variety of architectures to suit this aspiration; software developers must labor in a similar way on their end.

Source: NVIDIA Blogs