While Mantle and DirectX 12 are designed to reduce overhead and keep GPUs loaded, the conversation shifts when you are limited by shader throughput. Modern graphics processors are dominated by sometimes thousands of compute cores. Video drivers are complex packages of software. One of their many tasks is converting your scripts, known as shaders, into machine code for its hardware. If this machine code is efficient, it could mean drastically higher frame rates, especially at extreme resolutions and intense quality settings.
Emil Persson of Avalanche Studios, probably known best for the Just Cause franchise, published his slides and speech on optimizing shaders. His talk focuses on AMD's GCN architecture, due to its existence in both console and PC, while bringing up older GPUs for examples. Yes, he has many snippets of GPU assembly code.
AMD's GCN architecture is actually quite interesting, especially dissected as it was in the presentation. It is simpler than its ancestors and much more CPU-like, with resources mapped to memory (and caches of said memory) rather than "slots" (although drivers and APIs often pretend those relics still exist) and with how vectors are mostly treated as collections of scalars, and so forth. Tricks which attempt to combine instructions together into vectors, such as using dot products, can just put irrelevant restrictions on the compiler and optimizer… as it breaks down those vector operations into those very same component-by-component ops that you thought you were avoiding.
Basically, and it makes sense coming from GDC, this talk rarely glosses over points. It goes over execution speed of one individual op compared to another, at various precisions, and which to avoid (protip: integer divide). Also, fused multiply-add is awesome.
I know I learned.
As a final note, this returns to the discussions we had prior to the launch of the next generation consoles. Developers are learning how to make their shader code much more efficient on GCN and that could easily translate to leading PC titles. Especially with DirectX 12 and Mantle, which lightens the CPU-based bottlenecks, learning how to do more work per FLOP addresses the other side. Everyone was looking at Mantle as AMD's play for success through harnessing console mindshare (and in terms of Intel vs AMD, it might help). But honestly, I believe that it will be trends like this presentation which prove more significant… even if behind-the-scenes. Of course developers were always having these discussions, but now console developers will probably be talking about only one architecture – that is a lot of people talking about very few things.
This is not really reducing overhead; this is teaching people how to do more work with less, especially in situations (high resolutions with complex shaders) where the GPU is most relevant.
I liked the part about misuse
I liked the part about misuse of inverse trigonometric functions. I really think this needs to be covered better in high school level trig. We had matrix math, but aside from multiplication and *maybe* inversion, not much was covered in what you can *use* it for. You can scale a coordinate system, you can rotate it, you can do both at once for the same cost!
I ran into all of that by accident when I was reading a book on complex variables–where it was just mentioned in passing. Considering that I was ‘introduced’ to matrix math independently at least five times in the course of my education and never did they do much with it, matrix math is clearly more important that it is given credit for.
Things will get more
Things will get more intresting when specilized for gaming branch/control logic begins to be completely incorporated into these Compute/vector units and they can run their own functions without having to rely on the motherboard CPU for control, and expect there to be a few general purpose CPU cores to also be placed on Die with with the GPU, and there to be Gaming engines, and Gaming OSs residing on the descrete card’s/module’s Gddr5(or newer) resident memory. These systems will also come with large on die RAMs to cache the most latency affected OS/gaming engine code, and the GPU/CPU will share the wide GPU style Data BUS and memory controller. This functionalty will filter down from the HPC world where GPUs are increasingly being/have been used, and both AMD and Nvidia are producing products for the supercomputing market. The government exascale computing initiative is providing the development funds, and Nvidia is using JEDEC’s High Bandwidth Memory (HBM) standard, and other IBM initated Open IP standards. AMD will also be incorporating this IBM open IP, and its own version of HSA. Gradually these GPU vector processors will aquire all of the logic that the CPU has and these SOCs will become entire computing clusters on a single module/die. The Open standards will be adopted by the entire industry, including the specilized gaming or compute OSs that will be hosted on these devices. It will become much more economical to etch out the ultra wide data buses on a silica substrate and place the entire system on an on die/package module with stacked ram and other high speed components. this will take up less space and save on resources, as well as reducing trace length and latency. Strangely some GPU innovations will also filter up from the mobile market, and not necessarly come from AMD or Nvidia, as the PowerVR mobile series SKU is getting hardware ray tracing.