Yes, We’re Writing About a Forum Post

What is asynchronous compute, and how is it interpreted?

Update – July 19th @ 7:15pm EDT: Well that was fast. Futuremark published their statement today. I haven't read it through yet, but there's no reason to wait to link it until I do.

Update 2 – July 20th @ 6:50pm EDT: We interviewed Jani Joki, Futuremark's Director of Engineering, on our YouTube page. The interview is embed just below this update.

Original post below

The comments of a previous post notified us of an Overclock.net thread, whose author claims that 3DMark's implementation of asynchronous compute is designed to show NVIDIA in the best possible light. At the end of the linked post, they note that asynchronous compute is a general blanket, and that we should better understand what is actually going on.

So, before we address the controversy, let's actually explain what asynchronous compute is. The main problem is that it actually is a broad term. Asynchronous compute could describe any optimization that allows tasks to execute when it is most convenient, rather than just blindly doing them in a row.

I will use JavaScript as a metaphor. In this language, you can assign tasks to be executed asynchronously by passing functions as parameters. This allows events to execute code when it is convenient. JavaScript, however, is still only single threaded (without Web Workers and newer technologies). It cannot run callbacks from multiple events simultaneously, even if you have an available core on your CPU. What it does, however, is allow the browser to manage its time better. Many events can be delayed until the browser renders the page, it performs other high-priority tasks, or until the asynchronous code has everything it needs, like assets that are loaded from the internet.

This is asynchronous computing.

However, if JavaScript was designed differently, it would have been possible to run callbacks on any available thread, not just the main thread when available. Again, JavaScript is not designed in this way, but this is where I pull the analogy back into AMD's Asynchronous Compute Engines. In an ideal situation, a graphics driver will be able to see all the functionality that a task will require, and shove them down an at-work GPU, provided the specific resources that this task requires are not fully utilized by the existing work.

Read on to see how this is being implemented, and what the controversy is.

A simple example of this is performing memory transfers from the Direct Memory Access (DMA) queues while a shader or compute kernel is running. This is a trivial example, because I believe every Vulkan- or DirectX 12-supporting GPU can do it, even the mobile ones. NVIDIA, for instance, added this feature with CUDA 1.1 and the Tesla-based GeForce 9000 cards. It's discussed alongside other forms of asynchronous compute in DX12 and Vulkan programming talks, though.

What AMD has been pushing, however, is the ability to cram compute and graphics workloads together. When a task uses the graphics ASICs of a GPU, along with maybe a little bit of the shader capacity, the graphics driver could increase overall performance by cramming a compute task into the rest of the shader cores. This has the potential to be very useful. When I talked with a console engineer at Epic Games last year, he gave me a rough, before bed at midnight on a weekday estimate that ~10-25% of the Xbox One's GPU is idling. This doesn't mean that asynchronous compute will give a 10-25% increase in performance on that console, just that there's, again, ballpark, that much performance left on the table.

I've been asking around to see how this figure will scale, be it with clock rate, shader count, or whatever else. No-one I've asked seems to know. It might be an increasing benefit going forward… or not. Today? All we have to go on are a few benchmarks and test cases.

The 3DMark Time Spy Issue

The accusation made on the forum post is that 3DMark's usage of asynchronous compute more closely fits NVIDIA's architecture than it does AMD's. Under DOOM and Ashes of the Singularity, the AMD Fury X performs better than the GTX 1070. Under 3DMark Time Spy, however, it performs worse than the GTX 1070. They also claim that Maxwell does not take a performance hit where it should, if it was running code designed for AMD's use cases.

First, it is interesting that AMD's Fury X doesn't perform as well as the GTX 1070 in Time Spy. There could be many reasons for it. Futuremark could have not optimized for AMD as well as they should have, AMD could be in the process of updating their drivers, or NVIDIA could be in the process of updating their drivers for the other two games. We don't know. That said, if 3DMark could be more optimized for AMD, then they should obviously do it. I would be interested to see whether AMD brought up the issue with 3DMark pre-launch, and what their take is on the performance issue.

As for Maxwell not receiving a performance hit? I find that completely reasonable. A game developer will tend to avoid a performance-reducing feature on certain GPUs. It is not 3DMark's responsibility to intentionally enable a code path that would produce identical results, just with a performance impact. To be clear, the post didn't suggest that they should, but I want to underscore how benchmarks are made. All vendors submit their requests during the designated period, then the benchmark is worked on until it is finalized.

At the moment, 3DMark seems to oppose the other two examples that we have of asynchronous compute, leading to AMD having lower performance than expected, relative to NVIDIA. I would be curious to see what both graphics vendors, especially AMD as mentioned above, have to say about this issue.

As for which interpretation is better? Who knows. It seems like AMD's ability to increase the load on a GPU will be useful going forward, especially as GPUs get more complex because it doesn't seem like the logic required for asynchronous compute would scale too much in complexity with it.

For today's GPUs? We'll need to keep watching and see how software evolves. Bulldozer was a clever architecture, too. Software didn't evolve in the way that AMD expected, making the redundancies they eliminated not as redundant as they expected. Unlike Bulldozer, asynchronous compute is being adopted, both on the PC and on the consoles. Again, we'll need to see statements from AMD, NVIDIA, and Futuremark before we can predict how current hardware will perform in future software, though.

Update @ 7:15pm: As state at the top of the post, Futuremark released a statement right around the time I was publishing.