AMD Details hUMA: HSA in Action!

Subject: Processors
Manufacturer: AMD

heterogeneous Uniform Memory Access


Several years back we first heard AMD’s plans on creating a uniform memory architecture which will allow the CPU to share address spaces with the GPU.  The promise here is to create a very efficient architecture that will provide excellent performance in a mixed environment of serial and parallel programming loads.  When GPU computing came on the scene it was full of great promise.  The idea of a heavily parallel processing unit that will accelerate both integer and floating point workloads could be a potential gold mine in wide variety of applications.  Alas, the promise of the technology did not meet expectations when we have viewed the results so far.  There are many problems with combining serial and parallel workloads between CPUs and GPUs, and a lot of this has to do with very basic programming and the communication of data between two separate memory pools.

View Full Size

CPUs and GPUs do not share common memory pools.  Instead of using pointers in programming to tell each individual unit where data is stored in memory, the current implementation of GPU computing requires the CPU to write the contents of that address to the standalone memory pool of the GPU.  This is time consuming and wastes cycles.  It also increases programming complexity to be able to adjust to such situations.  Typically only very advanced programmers with a lot of expertise in this subject could program effective operations to take these limitations into consideration.  The lack of unified memory between CPU and GPU has hindered the adoption of the technology for a lot of applications which could potentially use the massively parallel processing capabilities of a GPU.

The idea for GPU compute has been around for a long time (comparatively).  I still remember getting very excited about the idea of using a high end video card along with a card like the old GeForce 6600 GT to be a coprocessor which would handle heavy math operations and PhysX.  That particular plan never quite came to fruition, but the idea was planted years before the actual introduction of modern DX9/10/11 hardware.  It seems as if this step with hUMA could actually provide a great amount of impetus to implement a wide range of applications which can actively utilize the GPU portion of an APU.

Click here to continue reading about AMD's hUMA architecture.

hUMA, Slightly More Detailed

The idea behind hUMA is quite simple; the CPU and GPU share memory resources, they are able to use pointers to access data that has been processed by either one or the other, and the GPU can take page faults and not rely only on page locked memory.  Memory in this case is bi-directionally coherent, so coherency issues with data in caches which are later written to main memory will not cause excessive waits for either the CPU or GPU to utilize data that has been changed in cache, but not yet written to main memory.

View Full Size

Current APUs work by partitioning off a chunk of main memory and holding onto it for dear life.  Some memory can be dynamically allocated, depending on the architecture we are dealing with.  Typically upon boot the integrated graphics will partition off a section of memory and keep it for its own.  The CPU cannot address that memory, and in fact it appears gone for all intents and purposes.  hUMA will change this.  The entire memory space will be available to both the CPU and GPU, and they end up sharing this resource just as another CPU with full coherency would with the primary CPU.  This not only applies to the physical memory, but also to the virtual memory space.

View Full Size

Standalone GPUs can benefit from HSA, but not to the extent of an APU.  A dedicated GPU will have its own attached memory as well as shared memory with the main CPU.  Due to the communication latency issues of writing from main memory to the video card’s memory, it is not nearly as seamless as what an APU can accomplish.  It makes sense that such a setup would benefit a solution with a shared memory pool as well as a shared memory controller.  Everything else involves more latency and differing amounts and types of memory.

View Full Size

As seen in the slides, AMD has covered the very high level design features of hUMA.  The first product that will feature this architecture will be the Kaveri based APUs, which will be introduced in 2H 2013.  These are Steamroller based parts with a GCN based graphics portion.  AMD is not giving more specific guidance about when this product will be released, but from all indications it will be more of a Q4 product in terms of availability.  Something of note is that the recently released Kabini processor does not fully utilize hUMA.  Though the APU features the latest Jaguar low power CPU core and the latest generation GCN based graphics portion, it is not a fully hUMA enabled part.  It appears to have the same basic limitations as the previous Llano and current Trinity APUs when it comes to memory allocation and sharing.

The greatest advantage of hUMA is likely that of the ease of programming as compared to current OpenCL and CUDA based solutions.  Often functions have to be programmed twice, once for the GPU and once for the CPU, and then results have to be copied over from the individual memory pools so the other unit can read the results attained by the other.  This is not only a lot of extra work, but the knowledge needed to adequately do this was typically reserved for elite level programmers with a keen understanding of the two different programming models.

View Full Size

Here we can see exactly how performance with CPUs and GPUs have compared in terms of pure GFLOPs.  Parallel computing, while not perfect for every workload, has a lot of potential if programming is implemented effectively.

Serial and parallel workloads can be much more adequately assigned to the hardware units that can address them best.  In heavily parallel loads, a GPU can see a 75% reduction in power usage when compared to a traditional CPU doing the same amount of work.  On the other hand, heavily serial work will utilize a CPU to a much greater extent, and therefore take less time and power to achieve the same result as a GPU trying to compute the same load.  By implementing this very key piece of technology, AMD and its HSA partners are hoping to further heterogeneous computing.  This technology is being shared with members of the HSA group with the hope that it will become the standard for heterogeneous systems, much like AMD-64 became the standard for x86 computing.

Video News

April 30, 2013 | 05:24 AM - Posted by razor512

Why can't they allow 2 way sharing between the dedicated video card and the main system, eg if you have a 2 GB video card, and you are not gaming, why not allow windows to use that memory for something?

Or better yet,why not seamlessly combine the CPU and GPU to work on the same task, without requiring any special programming like CUDA or openCL? Eg, have the CPU auto detect the processing needed and offload the work to the GPU if needed.

April 30, 2013 | 09:01 AM - Posted by nabokovfan87

Discrete GPU memory usage is going to fall into how much faster is that memory than the onboard stuff. If you have 2 GB on either and one is GDDR3 and the other is GDDR5, then it makes sense to use the discrete GPU memory to overcome the distance. They didn't say it couldn't be done, they simply said there was an inherent disadvantage to doing it that way.

Here is the text from the article:

Standalone GPUs can benefit from HSA, but not to the extent of an APU. A dedicated GPU will have its own attached memory as well as shared memory with the main CPU. Due to the communication latency issues of writing from main memory to the video card’s memory, it is not nearly as seamless as what an APU can accomplish. It makes sense that such a setup would benefit a solution with a shared memory pool as well as a shared memory controller. Everything else involves more latency and differing amounts and types of memory.

April 30, 2013 | 05:37 PM - Posted by Kerrash (not verified)

You are not a developer are you? :p

April 30, 2013 | 09:50 PM - Posted by nabokovfan87

Nope! I work on airplane seats at the moment. Went to school for Computer Engineering.

May 1, 2013 | 03:10 PM - Posted by Lord Binky (not verified)

Converting algorithms from sequential to parallel is a no-trivial problem that can quickly induce headaches if you are trying to do it well, there isn't a simple conversion much less one that provides the gains you desire from parallel processing.

As for letting windows use the RAM on the GPU as extra, for discrete GPUs the data has to go through the PCI-E bus which relative to the CPU's memory bus is really slow so you gain nothing from it.

May 1, 2013 | 06:00 PM - Posted by razor512

while there are bandwidth limits, it is still better than moving to virtual memory.

Kinda like having a game use up the videocards memory before dumping textures to system memory, it would be cool if applications could use spare memory in the video card as kinda of a L2 cache/memory before resorting to virtual memory.

April 30, 2013 | 07:19 AM - Posted by Sublym3 (not verified)

Few of questions:
1. will this require an OS to support the feature? Or a require a driver to be installed?

2. applications/games will need to be coded with this in mind? will there be any advantages or disadvantages for todays software?

3. is this limited to the integrated GPU or can discrete GPU also be used?

April 30, 2013 | 09:02 AM - Posted by nabokovfan87

1. It can be implemented any variety of ways. For AMDs system, I'm thinking it is on the memory controller itself. Meaning, the CPU will have some logic to recognize what type of command is being asked (integer vs. floating point math), and it will assign the job based on that priority. If floating point, then use GPU memory space, if not, use CPU memory space.

That is a very straightforward example, but it gives you an idea of what is going on.

Another way of doing it (for instance), would be to make is OS based. You have the OS do the match as far as what type of command it wants fulfilled and it determines the best place to complete it for that request and assigns it accordingly.

2. Depends on how it is implemented in 1. W/ AMDs solution, it is happening on a hardware refresh, meaning that all of that information should be mostly hardware based and will not require anything in particular. As far as how things are affected, think of it in terms of x86 applications on an x64 windows installation. There is going to be a slight overhead based on the assignment of the calculations, but it is a very small amount compared to the foreseeable gains in performance.

This means that applications not particularly tuned with specific commands used to help the CPU determine where to assign tasks will simply have it happen on the fly. Sort of how x86 still works on an x64 machine because it's still the same type of system and it still have the majority of things equal. Apps will still work/run because it is afterall, still a computer. Now, if the software is tuned, and the APU doesn't have to do a lot of extra math, simply read input commands and go, then it will be able to see gains something similar to what we see when GPUs are used to do heavy parallel processing loads for high end applications like video transcoding, folding, etc. That is assuming the software in question is doing those types of calculations, but yes, there will be a baseline gain in performance overall due to these changes.

3. No, it is not limited, it is going to be more efficient to not do so however. Think of it this way....

A to B is a few hundred nanometers. (CPU memory controller to onboard GPU memory), but A to C is 100x that distance. So, automatically with all things being equal, you have much further distance to travel just to electronically send the signal from one point to another.

Thats why 20 nm is inherently much faster then a 45 nm chip with all things being equal.

April 30, 2013 | 09:49 AM - Posted by PaulDriver

Finally, the PC will be feature complete with the Amiga. :) :P

( for the humor impaired this is me being funny.
OMG!! Am I the Humor Impaired One? )

(No not that one, the OOD one)

April 30, 2013 | 10:40 AM - Posted by Pathan Rora Kana (not verified)

hey what you called about HUMA what is Huma
fm Makawa Mou

April 30, 2013 | 12:35 PM - Posted by Anonymous (not verified)

Yes HSA can take advantage of the GPU for general purpose compute, but will there be driver improvments that can take an OpenGL driver and utilize the CPU and the GPU for Editing/Rendering 3d models, or is this just going to be something that can only be used through OpenCL for GPGPU compute! I hope that in the future that the drivers for graphics could be written for HSA such that the underlying hardware could be abstracted through HSA aware graphics drivers to utilize all of the CPU and GPU resources for Graphics, when the user needs more graphics power, as well as the GPGPU, when the user needs the GPU for compute tasks! Currently in Blender 3d editing uses mostly the GPU and very little of the CPU for editing 3d models, while rendering for Blender, on the Intel CPU of my laptop, uses mostly the CPU! It would be great if the OpenGL drivers could utilize all of the computational resources on my laptop for editing High polygon count models and an APU that could do this would be great!

April 30, 2013 | 09:51 PM - Posted by nabokovfan87

More then likely that would be bios side from the motherboard and/or windows side patches.

May 1, 2013 | 11:07 AM - Posted by Josh Walrath

Apparently adding OS support is pretty trivial for HSA.  The big changes are to compilers and libraries.  That part is still hard.  The OS just won't natively say, "Hey, this would run so much better on the GPU portion!"

May 1, 2013 | 10:02 PM - Posted by nabokovfan87

Hey Josh,

Have you heard anything as far as performance of APU vs. Discrete GPU affects? Let's say both have GDDR5, whats the % hit? I know it depends on what itself is being done, but has any of this come out yet?

May 1, 2013 | 10:04 PM - Posted by nabokovfan87

I was thinking it would be more along the lines of "windows patches" like we had with bulldozer fixing some of the timing issues and bios updates to actually affect how things perform on the chipset or the chip itself.

We don't traditionally have bios update for a particular processor, but it seems like these should/would benefit from that type of update.

April 30, 2013 | 08:53 PM - Posted by Anonymous (not verified)


GL! waiting on the code zombies, AMD.

May 3, 2013 | 02:38 PM - Posted by Anonymous (not verified)

i was playing with opencl some months a go and one of the problemss i found is that all the program is loaded to cpu mem, then you chose cpu or gpu, if you chose gpu all the program is coppied to the gpu ram adding lattency, if you program is small the latency is bigger than the time it requieres the cpu to make it, so is no point in doing it, so that take out any interactive program you wish to make like games, so this solve that and maybe some other things i dont know

May 4, 2013 | 06:31 AM - Posted by jedibeeftrix (not verified)

theinquirer is reporting that IOMMU v2.5 is essential for HSA support, and logically this will be included in Kaveri (or its chipset?), but where does this leave what is technically AMD's 'premium' platform, AM3+?

My 890FX is IOMMU compliant, and the 990FX series carry this feature too, but will that extend to v2.5 and thus enable HSA support with some future HSA enabled AMD GPU?

May 6, 2013 | 12:30 PM - Posted by Josh Walrath

Not unless they update that chipset.  Theoretically they could use the A85X as the southbridge and up the support to 2.5.  Not entirely sure they will though.  Plus, AM3+ doesn't have an APU, so hUMA is somewhat limited anyway.

June 25, 2013 | 08:11 PM - Posted by ezjohny

I say if you have an all AMD setup, weather if it is an APU or a CPU with a dedicated video card should work together, this is good for the APU archtecture design but effect little for a CPU with a dedicated graphic card. AMD should put an architecture together for FX CPU's and a dedicated graphics cards so where both could work together, they already do but they should improve on this area, It would be good for the competition!!!

PC gamer here!!!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.