Realworldtech with Compelling Evidence

RWT tests show possibly tiler raster used in current NV GPUs

Yesterday David Kanter of Realworldtech posted a pretty fascinating article and video that explored the two latest NVIDIA architectures and how they have branched away from the traditional immediate mode rasterization units.  It has revealed through testing that with Maxwell and Pascal NVIDIA has gone to a tiling method with rasterization.  This is a somewhat significant departure for the company considering they have utilized the same basic immediate mode rasterization model since the 90s.

The Videologic Apocolypse 3Dx based on the PowerVR PCX2.

(photo courtesy of Wikipedia)

Tiling is an interesting subject and we can harken back to the PowerVR days to see where it was first implemented.  There are many advantages to tiling and deferred rendering when it comes to overall efficiency in power and memory bandwidth.  These first TBDR (Tile Based Deferred Renderers) offered great performance per clock and could utilize slower memory as compared to other offerings of the day (namely Voodoo Graphics).  There were some significant drawbacks to the technology.  Essentially a lot of work had to be done by the CPU and driver in scene setup and geometry sorting.  On fast CPU systems the PowerVR boards could provide very good performance, but it suffered on lower end parts as compared to the competition.  This is a very simple explanation of what is going on, but the long and short of it is that TBDR did not take over the world due to limitations in its initial implementations.  Traditional immediate mode rasters would improve in efficiency and performance with aggressive Z checks and other optimizations that borrow from the TBDR playbook.

Tiling is also present in a lot of mobile parts.  Imagination’s PowerVR graphics technologies have been implemented by others such as Intel, Apple, Mediatek, and others.  Qualcomm (Adreno) and ARM (Mali) both implement tiler technologies to improve power consumption and performance while increasing bandwidth efficiency.  Perhaps most interestingly we can remember back to the Gigapixel days with the GP-1 chip that implemented a tiling method that seemed to work very well without the CPU hit and driver overhead that had plagued the PowerVR chips up to that point.  3dfx bought Gigapixel for some $150 million at the time.  That company then went on to file bankruptcy a year later and their IP was acquired by NVIDIA.

Screenshot of the program used to uncover the tiling behavior of the rasterizer.

It now appears as though NVIDIA has evolved their raster units to embrace tiling.  This is not a full TBDR implementation, but rather an immediate mode tiler that will still break up the scene in tiles but does not implement deferred rendering.  This change should improve bandwidth efficiency when it comes to rasterization, but it does not affect the rest of the graphics pipeline by forcing it to be deferred (tessellation, geometry setup and shaders, etc. are not impacted).  NVIDIA has not done a deep dive on this change for editors, so we do not know the exact implementation and what advantages we can expect.  We can look at the evidence we have and speculate where those advantages exist.

The video where David Kanter explains his findings

 

Bandwidth and Power

Tilers have typically taken the tiled regions and buffered them on the chip.  This is a big improvement in both performance and power efficiency as the raster data does not have to be cached and written out to the frame buffer and then swapped back.  This makes quite a bit of sense considering the overall lack of big jumps in memory technologies over the past five years.  We have had GDDR-5 since 2007/2008.  The speeds have increased over time, but the basic technology is still much the same.  We have seen HBM introduced with AMD’s Fury series, but large scale production of HBM 2 is still to come.  Samsung has released small amounts of HBM 2 to the market, but not nearly enough to handle the needs of a mass produced card.  GDDR-5X is an extension of GDDR-5 that does offer more bandwidth, but it is still not a next generation memory technology like HBM 2.

By utilizing a tiler NVIDIA is able to lower memory bandwidth needs for the rasterization stage. Considering that both Maxwell and Pascal architectures are based on GDDR-5 and 5x technologies, it makes sense to save as much bandwidth as possible where they can.  This is again probably one, among many, of the reasons that we saw a much larger L2 cache in Maxwell vs. Kepler (2048 KB vs. 256KB respectively).  Every little bit helps when we are looking at hard, real world bandwidth limits for a modern GPU.

The area of power efficiency has also come up in discussion when going to a tiler.  Tilers have traditionally been more power efficient as well due to how the raster data is tiled and cached, requiring fewer reads and writes to main memory.  The first impulse is to say, “Hey, this is the reason why NVIDIA’s Maxwell was so much more power efficient than Kepler and AMD’s latest parts!”  Sadly, this is not exactly true.  The tiler is more power efficient, but it is a small part to the power savings on a GPU.

The second fastest Pascal based card…

A modern GPU is very complex.  There are some 7.2 billion transistors on the latest Pascal GP-104 that powers the GTX 1080.  The vast majority of those transistors are implemented in the shader units of the chip.  While the raster units are very important, they are but a fraction of that transistor budget.  The rest is taken up by power regulation, PCI-E controllers, and memory controllers.  In the big scheme of things the raster portion is going to be dwarfed in power consumption by the shader units.  This does not mean that they are not important though.  Going back to the hated car analogy, one does not achieve weight savings by focusing on one aspect alone.  It is going over every single part of the car and shaving ounces here and there, and in the end achieving significant savings by addressing every single piece of a complex product.

This does appear to be the long and short of it.  This is one piece of a very complex ASIC that improves upon memory bandwidth utilization and power efficiency.  It is not the whole story, but it is an important part.  I find it interesting that NVIDIA did not disclose this change to editors with the introduction of Maxwell and Pascal, but if it is transparent to users and developers alike then there is no need.  There is a lot of “secret sauce” that goes into each architecture, and this is merely one aspect.  The one question that I do have is how much of the technology is based upon the Gigapixel IP that 3dfx bought at such a premium?  I believe that particular tiler was an immediate mode renderer as well due to it not having as many driver and overhead issues that PowerVR exhibited back in the day.  Obviously it would not be a copy/paste of the technology that was developed back in the 90s, it would be interesting to see if it was the basis for this current implementation.