AMD Introduces Radeon Pro SSG: A Professional GPU Paired With Low Latency Flash Storage (Updated)

Subject: Graphics Cards | July 27, 2016 - 01:56 AM |
Tagged: solid state, radeon pro, Polaris, gpgpu, amd

UPDATE (July 27th, 1am ET):  More information on the Radeon Pro SSG has surfaced since the original article. According to AnandTech, the prototype graphics card actually uses an AMD Fiji GPU. The Fiji GPU is paired onboard PCI-E based storage using the same PEX8747 bridge chip used in the Radeon Pro Duo. Storage is handled by two PCI-E 3.0 x4 M.2 slots that can accommodate up to 1TB of NAND flash storage. As I mentioned below, having the storage on board the graphics card vastly reduces latency by reducing the number of hops and not having to send requests out to the rest of the system. AMD had more numbers to share following their demo, however.

From the 8K video editing demo, the dual Samsung 950 Pro PCI-E SSDs (in RAID 0) on board the Radeon Pro SSG hit 4GB/s while scrubbing through the video. That same video source stored on a Samsung 950 Pro attached to the motherboard had throughput of only 900MB/s. In theory, reaching out to system RAM still has raw throughput advantages (with DDR4 @ 3200 MHz  on a Haswell-E platform theroretically capable of 62 GB/s reads and 47 GB/s writes though that would be bottlenecked by the graphics card having to go over the PCI-E 3.0 x16 link and it's maximum of 15.754 GB/s.). Of course if you can hold it in (much smaller) GDDR5 (300+GB/s depending on clocks and memory bus width) or HBM (1TB/s) and not have to go out to any other storage tier that's ideal but not always feasible especially in the HPC world.

However, having onboard storage on the same board as the GPU only a single "hop" away vastly reduces latency and offers much more total storage space than most systems have in DDR3 or DDR4. In essence, the solid state storage on the graphics card (which developers will need to specifically code for) acts as a massive cache for streaming in assets for data sets and workloads that are highly impacted by latency. This storage is not the fastest, but is the next best thing for holding active data outside of GDDR5/x or HBM. For throughput intensive workloads reaching out to system RAM will be better Finally, reaching out to system attached storage should be the last resort as it will be the slowest and most latent. Several commentors mentioned using a PCI-E based SSD in a second slot on the motherboard accessed much like GPUs in CrossFire communicate now (DMA over the PCI-E bus) which is an interesting idea that I had not considered.

Per my understanding of the situation, I think that the on board SSG storage would still be slightly more beneficial than this setup but it would get you close (I am assuming the GPU would be able to directly interact and request data from the SSD controller and not have to rely on the system CPU to do this work but I may well be mistaken. I will have to look into this further and ask the experts heh). On the prototype Radeon Pro SSG the M.2 slots are actually able to be seen as drives by the system and OS so it is essentially acting as if there was a PCI-E adapter card in a slot on the motherboard holding those drives but that may not be the case should this product actually hit the market. I do question their choice to go with Fiji rather than Polaris, but it sounds like they built the prototype off of the Radeon Pro Duo platform so I suppose it would make sense there.

Hopefully the final versions in 2017 or beyond use at least Vega though :).

 Alongside the launch of new Radeon Pro WX (workstation) series graphics cards, AMD teased an interesting new Radeon Pro product: the Radeon Pro SSG. This new professional graphics card pairs a Polaris GPU with up ot a terabyte of on board solid state storage and seeks to solve one of the biggest hurdles in GP GPU performance when dealing with extremely large datasets which is latency.

AMD Radeon Pro SSG.jpg

One of the core focuses of AMD's HSA (heterogeneous system architecture) is unified memory and the ability of various processors (CPU, GPU, specialized co-processors, et al) to work together efficiently by being able to access and manipulate data from the same memory pool without having to copy data bck and forth between CPU-accessible memory and GPU-accessible memory. With the Radeon Pro SSG, this idea is not fully realized (it is more of a sidestep), but it will move performance further. It does not eliminate the need to copy data to the GPU before it can work on it, but once copied the GPU will be able to work on data stored in what AMD describes as a one terabyte frame buffer. This memory will be solid state and very fast, but more importantly it will be able to get at the data with much lower latency than previous methods. AMD claims the solid state storage (likely NAND but they have not said) will link with the GPU over a dedicated PCI-E bus. I suppose that if you can't bring the GPU to the data, you bring the data to the GPU!

Considering AMD's previous memory champ – the Radeon W9100 – maxed out at 32GB of GDDR5, the teased Radeon Pro SSG with its 1TB of purportedly low latency onboard flash storage opens up a slew of new possibilities for researchers and professionals in media, medical, and scientific roles working with massive datasets for imaging, creation, and simulations! I expect that there are many professionals out there eager to get their hands on one of these cards! They will be able to as well thanks to a beta program launching shortly, so long as they have $10,000 for the hardware!

AMD gave a couple of examples in their PR on the potential benefits of its "solid state graphics" including the ability to image a patient's beating heart in real time to allow medical professionals to examine and spot issues as early as possible and using the Radeon Pro SSG to edit and scrub through 8K video in real time at 90 FPS versus 17 with current offerings. On the scientific side of things being able to load up entire models into the new graphics memory (not as low latency as GDDR5 or HBM certainly) will be a boon as will being able to get data sets as close to the GPU as possible into servers using GPU accelerated databases powering websites accessed by millions of users.

It is not exactly the HSA future I have been waiting for ever so impatiently, but it is a nice advancement and an intriguing idea that I am very curious to see how well it pans out and if developers and researchers will truly take advantage of and use to further their projects. I suspect something like this could be great for deep learning tasks as well (such as powering the "clouds" behind self driving cars perhaps).

Stay tuned to PC Perspective for more information as it develops.

This is definitely a product that I will be watching and I hope that it does well. I am curious what Nvidia's and Intel's plans are here as well! What are your thoughts on AMD's "Solid State Graphics" card? All hype or something promising?

Source: AMD

James Reinders Leaving Intel and What It Means

Subject: Processors | June 8, 2016 - 08:17 AM |
Tagged: Xeon Phi, Intel, gpgpu

Intel's recent restructure had a much broader impact than I originally believed. Beyond the large number of employees who will lose their jobs, we're even seeing it affect other areas of the industry. Typically, ASUS releases their ZenPhone line with x86 processors, which I assumed was based on big subsidies from Intel to push their instruction set into new product categories. This year, ASUS chose the ARM-based Qualcomm Snapdragon, which seemed to me like Intel decided to stop the bleeding.


That brings us to today's news. After over 27 years at Intel, James Reinders accepted the company's early retirement offer, scheduled for his 10001st day with the company, and step down from his position as Intel's High Performance Computing Director. He worked on the Larabee and Xeon Phi initiatives, and published several books on parallelism.

According to his letter, it sounds like his retirement offer was part of a company-wide package, and not targeting his division specifically. That would sort-of make sense, because Intel is focusing on cloud and IoT. Xeon Phi is an area that Intel is battling NVIDIA for high-performance servers, and I would expect that it has potential for cloud-based applications. Then again, as I say that, AWS only has a handful of GPU instances, and they are running fairly old hardware at that, so maybe the demand isn't there yet.

AMD Brings Dual Fiji and HBM Memory To Server Room With FirePro S9300 x2

Subject: Graphics Cards | April 5, 2016 - 02:13 AM |
Tagged: HPC, hbm, gpgpu, firepro s9300x2, firepro, dual fiji, deep learning, big data, amd

Earlier this month AMD launched a dual Fiji powerhouse for VR gamers it is calling the Radeon Pro Duo. Now, AMD is bringing its latest GCN architecture and HBM memory to servers with the dual GPU FirePro S9300 x2.

AMD Firepro S9300x2 Server HPC Card.jpg

The new server-bound professional graphics card packs an impressive amount of computing hardware into a dual-slot card with passive cooling. The FirePro S9300 x2 combines two full Fiji GPUs clocked at 850 MHz for a total of 8,192 cores, 512 TUs, and 128 ROPs. Each GPU is paired with 4GB of non-ECC HBM memory on package with 512GB/s of memory bandwidth which AMD combines to advertise this as the first professional graphics card with 1TB/s of memory bandwidth.

Due to lower clockspeeds the S9300 x2 has less peak single precision compute performance versus the consumer Radeon Pro Duo at 13.9 TFLOPS versus 16 TFLOPs on the desktop card. Businesses will be able to cram more cards into their rack mounted servers though since they do not need to worry about mounting locations for the sealed loop water cooling of the Radeon card.

  FirePro S9300 x2 Radeon Pro Duo R9 Fury X FirePro S9170
GPU Dual Fiji Dual Fiji Fiji Hawaii
GPU Cores 8192 (2 x 4096) 8192 (2 x 4096) 4096 2816
Rated Clock 850 MHz 1050 MHz 1050 MHz 930 MHz
Texture Units 2 x 256 2 x 256 256 176
ROP Units 2 x 64 2 x 64 64 64
Memory 8GB (2 x 4GB) 8GB (2 x 4GB) 4GB 32GB ECC
Memory Clock 500 MHz 500 MHz 500 MHz 5000 MHz
Memory Interface 4096-bit (HBM) per GPU 4096-bit (HBM) per GPU 4096-bit (HBM) 512-bit
Memory Bandwidth 1TB/s (2 x 512GB/s) 1TB/s (2 x 512GB/s) 512 GB/s 320 GB/s
TDP 300 watts ? 275 watts 275 watts
Peak Compute 13.9 TFLOPS 16 TFLOPS 8.60 TFLOPS 5.24 TFLOPS
Transistor Count 17.8B 17.8B 8.9B 8.0B
Process Tech 28nm 28nm 28nm 28nm
Cooling Passive Liquid Liquid Passive
MSRP $6000 $1499 $649 $4000

AMD is aiming this card at datacenter and HPC users working on "big data" tasks that do not require the accuracy of double precision floating point calculations. Deep learning tasks, seismic processing, and data analytics are all examples AMD says the dual GPU card will excel at. These are all tasks that can be greatly accelerated by the massive parallel nature of a GPU but do not need to be as precise as stricter mathematics, modeling, and simulation work that depend on FP64 performance. In that respect, the FirePro S9300 x2 has only 870 GLFOPS of double precision compute performance.

Further, this card supports a GPGPU optimized Linux driver stack called GPUOpen and developers can program for it using either OpenCL (it supports OpenCL 1.2) or C++. AMD PowerTune, and the return of FP16 support are also features. AMD claims that its new dual GPU card is twice as fast as the NVIDIA Tesla M40 (1.6x the K80) and 12 times as fast as the latest Intel Xeon E5 in peak single precision floating point performance. 

The double slot card is powered by two PCI-E power connectors and is rated at 300 watts. This is a bit more palatable than the triple 8-pin needed for the Radeon Pro Duo!

The FirePro S9300 x2 comes with a 3 year warranty and will be available in the second half of this year for $6000 USD. You are definitely paying a premium for the professional certifications and support. Here's hoping developers come up with some cool uses for the dual 8.9 Billion transistor GPUs and their included HBM memory!

Source: AMD

NVIDIA Grants $200,000 to UoT for Cancer Research

Subject: Graphics Cards | November 29, 2015 - 05:52 PM |
Tagged: nvidia, cancer research, gpgpu

The University of Toronto has just received a $200,000 grant from the NVIDIA Foundation for research in identifying genetic links to cancer. The institution uses GPUs to learn and identify mutations that cause the disease, which is hoped to eventually help diagnose the attributes of cancer for a specific patient and provide exact treatments. Their “next step” is comparing their technology with data from patients.


I am not too informed on cancer research, so I will point to the article and its sources for specifics. The team state that the libraries they create will be freely available for other biomedical researchers. They don't mention specific licenses or anything, but the article is not really an appropriate venue for that sort of discussion.

Source: nvidia

Microsoft Allows Developer Use of Kinect-Reserved Shaders

Subject: General Tech | June 7, 2014 - 04:32 AM |
Tagged: microsoft, xbox one, xbone, gpgpu, GCN

Shortly after the Kinect deprecation, Microsoft has announced that a 10% boost in GPU performance will be coming to Xbox One. This, of course, is the platform allowing developers to avoid the typical overhead which Kinect requires for its various tasks. Updated software will allow game developers to regain some or all of that compute time back.


Still looks like Wall-E grew a Freddie Mercury 'stache.

While it "might" (who am I kidding?) be used to berate Microsoft for ever forcing the Kinect upon users in the first place, this functionality was planned from before launch. Pre-launch interviews stated that Microsoft was looking into scheduling their compute tasks while the game was busy, for example, hammering the ROPs and leaving the shader cores idle. This could be that, and only that, or it could be a bit more if developers are allowed to opt out of most or all Kinect computations altogether.

The theoretical maximum GPU compute and shader performance of the Xbox One GPU is still about 29% less than its competitor, the PS4. Still, 29% less is better than about 36% less. Not only that, but the final result will always come down to the amount of care and attention spent on any given title by its developers. This will give them more breathing room, though.

Then, of course, the PC has about 3x the shader performance of either of those systems in a few single-GPU products. Everything should be seen in perspective.

Source: Eurogamer

Asus Launches GTX TITAN Z Dual GK110 Graphics Card

Subject: Graphics Cards | May 2, 2014 - 01:29 AM |
Tagged: titan z, nvidia, gpgpu, gk110, dual gpu, asus

NVIDIA unveiled the GeForce GTX TITAN Z at the GPU Technology Conference last month, and the cards will be for sale soon from various partners. ASUS will be one of the first AIB partners to offer a reference TITAN-Z.

The ASUS GTX TITAN Z pairs two full GK110-based GPUs with 12GB of GDDR5 memory. The graphics card houses a total of 5,760 CUDA cores, 480 texture manipulation units (TMUs), and 96 ROPs. Each GK110 GPU interfaces with 6GB of GDDR5 memory via a 384-bit bus. ASUS is using reference clockspeeds with this card, which means 705 MHz base and up to 876 MHz GPU Boost for the GPUs and 7.0 GHz for the memory.

ASUS GTX TITAN Z Dual GPU Graphics Card.jpg

For comparison, the dual-GPU TITAN Z is effectively two GTX TITAN Black cards on a single PCB. However, the TITAN Black runs at 889 MHz base and up to 980 MHz GPU Boost. A hybrid water cooling solution may have allowed NVIDIA to maintain the clockspeed advantage, but doing so would compromise the only advantage the TITAN Z has over using two (much cheaper) TITAN Blacks in a workstation or server: card density. A small hit in clockspeed will be a manageable sacrifice for the target market, I believe.

The ASUS GTX TITAN Z has a 375W TDP and is powered by two 8-pin PCI-E power connectors. The new flagship dual GPU NVIDIA card has an MSRP of $3,000 and should be available in early May.

Source: Asus

NVIDIA Will Present Global Impact Award And $150,000 Grant To Researchers At GTC 2015

Subject: General Tech | April 8, 2014 - 05:03 PM |
Tagged: research, nvidia, GTC, gpgpu, global impact award

During the GPU Technology Conference last month, NVIDIA introduced a new annual grant called the Global Impact Award. The grant awards $150,000 to researchers using NVIDIA GPUs to research issues with worldwide impact such as disease research, drug design, medical imaging, genome mapping, urban planning, and other "complex social and scientific problems."

NVIDIA Global Impact Award.png

NVIDIA will be presenting the Global Impact Award to the winning researcher or non-profit institution at next year's GPU Technology Conference (GTC 2015). Individual researchers, universities, and non-profit research institutions that are using GPUs as a significant enabling technology in their research are eligible for the grant. Both third party and self-nomiations (.doc form) are accepted with the nominated candidates being evaluated based on several factors including the level of innovation, social impact, and current state of the research and its effectiveness in approaching the problem. Submissions for nominations are due by December 12, 2014 with the finalists being announced by NVIDIA on March 13, 2015. NVIDIA will then reveal the winner of the $150,000 grant at GTC 2015 (April 28, 2015).

The researcher, university, or non-profit firm can be located anywhere in the world, and the grant money can be assigned to a department, initiative, or a single project. The massively parallel nature of modern GPUs makes them ideal for many times of research with scalable projects, and I think the Global Impact Award is a welcome incentive to encourage the use of GPGPU in applicable research projects. I am interested to see what the winner will do with the money and where the research leads.

More information on the Global Impact Award can be found on the NVIDIA website.

Source: NVIDIA

GTC 2014: NVIDIA Awards Startup Map-D $100,000 In Early Stage Challenge

Subject: General Tech | March 26, 2014 - 08:49 PM |
Tagged: remote graphics, nvidia, GTC 2014, gpgpu, emerging companies summit, ecs 2014, cloud computing

NVIDIA started the Emerging Companies Summit six years ago, and since then the event has grown in size and scope to identify and support those technology companies tha leverage (or plan to leverage) GPGPU computing to deliver innovative products. The ECS continues to be a platform for new startups to showcase their work at the annual GPU Technology Conference. NVIDIA provides support in the form of legal, developmental, and co-marketing to the companies featured at ECS.

GTC 2014 ECS GPGPU Technologies.jpg

There was an interesting twist this year though in the form of the Early Start Challenge. This is a new aspect to ECS in addition to the ‘One to Watch’ award. I attended the Emerging Companies Summit again this year and managed to snag some photos and participate in the Early Start Challenge (disclosure: i voted for Audiostream TV).

GTC 2014 ECS Early Start Challenge Companies.jpg

The 12 Early Start Challenge contestants take the stage at once to await the vote tally.

During the challenge, 12 selected startup companies were each given eight minutes on stage to pitch their company and why their innovations were deserving of the $100,000 grand prize. The on stage time was divided into a four minute presentation and a four minute Q&A session with the panel of judges (this year the audience was not part of the Q&A session at ECS unlike last year due to time constraints).

After all 12 companies had their chance on stage, the panel of judges and the audience submitted their votes for the most innovative startup. The panel of judges included:

  • Scott Budman Business & Technology Reporter, NBC
  • Jeff Herbst Vice President of Business Development, NVIDIA
  • Jens Hortsmann Executive Producer & Managing Partner, Crestlight Venture Productions
  • Pat Moorhead President & Principal Analyst, Moor Insights & Strategy
  • Bill Reichert Managing Director, Garage Technology Ventures

The companies participating in the challenge include Okam Studio, MyCloud3D, Global Valuation, Brytlyt, Clarifai, Aerys, oMobio, ShiVa Technologies, IGI Technologies, Map-D, Scalable Graphics, and AudioStream TV. The companies are involved in machine learning, deep neural networks, computer vision, remote graphics, real time visualization, gaming, and big data analytics.

After all the votes were tallied, Map-D was revealed to be the winner and received a check for $100,000 from NVIDIA Vice President of Business Development Jeff Herbst.

Map-D Wins ECS Early Start Challenge.jpg

Jeff Herbst awarding Map-D's CEO with the Early Start Challenge grand prize check. From left to right: Scott Budman, Jeff Herbst, and Thomas Graham.

Map-D is a company that specializes in a scaleable in-memory GPU database that promises millisecond queries directly from GPU memory (with GPU memory bandwidth being the bottleneck) and very fast database inserts. The company is working with Facebook and PayPal to analyze data. In the case of Facebook, Map-D is being used to analyze status updates in real time to identify malicious behavior. The software can be scaled across eight NVIDIA Tesla cards to analyze a billion Twitter tweets in real time.

It is specialized software, but extremely useful within its niche. Hopefully the company puts the prize money to good use in furthering its GPGPU endeavors. Although there was only a single grand prize winner, I found all the presentations interesting and look forward to seeing where they go from here.

Read more about the Emerging Companies Summit (from last year) and keep track of new GTC 2014 articles by following the GTC 2014 tag @ PC Perspective.

Source: PC

Intel Xeon Phi to get Serious Refresh in 2015?

Subject: General Tech, Graphics Cards, Processors | November 28, 2013 - 03:30 AM |
Tagged: Intel, Xeon Phi, gpgpu

Intel was testing the waters with their Xeon Phi co-processor. Based on the architecture designed for the original Pentium processors, it was released in six products ranging from 57 to 61 cores and 6 to 16GB of RAM. This lead to double precision performance of between 1 and 1.2 TFLOPs. It was fabricated using their 22nm tri-gate technology. All of this was under the Knights Corner initiative.


In 2015, Intel plans to have Knights Landing ready for consumption. A modified Silvermont architecture will replace the many simple (basically 15 year-old) cores of the previous generation; up to 72 Silvermont-based cores (each with 4 threads) in fact. It will introduce the AVX-512 instruction set. AVX-512 allows applications to vectorize 8 64-bit (double-precision float or long integer) or 16 32-bit (single-precision float or standard integer) values.

In other words, packing a bunch of related problems into a single instruction.

The most interesting part? Two versions will be offered: Add-In Boards (AIBs) and a standalone CPU. It will not require a host CPU, because of its x86 heritage, if your application is entirely suited for an MIC architecture; unlike a Tesla, it is bootable with existing and common OSes. It can also be paired with standard Xeon processors if you would like a few strong threads with the 288 (72 x 4) the Xeon Phi provides.

And, while I doubt Intel would want to cut anyone else in, VR-Zone notes that this opens the door for AIB partners to make non-reference cards and manage some level of customer support. I'll believe a non-Intel branded AIB only when I see it.

Source: VR-Zone
Manufacturer: Scott Michaud

A new generation of Software Rendering Engines.

We have been busy with side projects, here at PC Perspective, over the last year. Ryan has nearly broken his back rating the frames. Ken, along with running the video equipment and "getting an education", developed a hardware switching device for Wirecase and XSplit.

My project, "Perpetual Motion Engine", has been researching and developing a GPU-accelerated software rendering engine. Now, to be clear, this is just in very early development for the moment. The point is not to draw beautiful scenes. Not yet. The point is to show what OpenGL and DirectX does and what limits are removed when you do the math directly.

Errata: BioShock uses a modified Unreal Engine 2.5, not 3.

In the above video:

  • I show the problems with graphics APIs such as DirectX and OpenGL.
  • I talk about what those APIs attempt to solve, finding color values for your monitor.
  • I discuss the advantages of boiling graphics problems down to general mathematics.
  • Finally, I prove the advantages of boiling graphics problems down to general mathematics.

I would recommend watching the video, first, before moving forward with the rest of the editorial. A few parts need to be seen for better understanding.

Click here, after you watch the video, to read more about GPU-accelerated Software Rendering.