Tiler Exposed in Maxwell/Pascal Architectures

Manufacturer: Realworldtech

Realworldtech with Compelling Evidence

Yesterday David Kanter of Realworldtech posted a pretty fascinating article and video that explored the two latest NVIDIA architectures and how they have branched away from the traditional immediate mode rasterization units.  It has revealed through testing that with Maxwell and Pascal NVIDIA has gone to a tiling method with rasterization.  This is a somewhat significant departure for the company considering they have utilized the same basic immediate mode rasterization model since the 90s.

View Full Size

The Videologic Apocolypse 3Dx based on the PowerVR PCX2.

(photo courtesy of Wikipedia)

Tiling is an interesting subject and we can harken back to the PowerVR days to see where it was first implemented.  There are many advantages to tiling and deferred rendering when it comes to overall efficiency in power and memory bandwidth.  These first TBDR (Tile Based Deferred Renderers) offered great performance per clock and could utilize slower memory as compared to other offerings of the day (namely Voodoo Graphics).  There were some significant drawbacks to the technology.  Essentially a lot of work had to be done by the CPU and driver in scene setup and geometry sorting.  On fast CPU systems the PowerVR boards could provide very good performance, but it suffered on lower end parts as compared to the competition.  This is a very simple explanation of what is going on, but the long and short of it is that TBDR did not take over the world due to limitations in its initial implementations.  Traditional immediate mode rasters would improve in efficiency and performance with aggressive Z checks and other optimizations that borrow from the TBDR playbook.

Tiling is also present in a lot of mobile parts.  Imagination’s PowerVR graphics technologies have been implemented by others such as Intel, Apple, Mediatek, and others.  Qualcomm (Adreno) and ARM (Mali) both implement tiler technologies to improve power consumption and performance while increasing bandwidth efficiency.  Perhaps most interestingly we can remember back to the Gigapixel days with the GP-1 chip that implemented a tiling method that seemed to work very well without the CPU hit and driver overhead that had plagued the PowerVR chips up to that point.  3dfx bought Gigapixel for some $150 million at the time.  That company then went on to file bankruptcy a year later and their IP was acquired by NVIDIA.

View Full Size

Screenshot of the program used to uncover the tiling behavior of the rasterizer.

It now appears as though NVIDIA has evolved their raster units to embrace tiling.  This is not a full TBDR implementation, but rather an immediate mode tiler that will still break up the scene in tiles but does not implement deferred rendering.  This change should improve bandwidth efficiency when it comes to rasterization, but it does not affect the rest of the graphics pipeline by forcing it to be deferred (tessellation, geometry setup and shaders, etc. are not impacted).  NVIDIA has not done a deep dive on this change for editors, so we do not know the exact implementation and what advantages we can expect.  We can look at the evidence we have and speculate where those advantages exist.

The video where David Kanter explains his findings


Bandwidth and Power

Tilers have typically taken the tiled regions and buffered them on the chip.  This is a big improvement in both performance and power efficiency as the raster data does not have to be cached and written out to the frame buffer and then swapped back.  This makes quite a bit of sense considering the overall lack of big jumps in memory technologies over the past five years.  We have had GDDR-5 since 2007/2008.  The speeds have increased over time, but the basic technology is still much the same.  We have seen HBM introduced with AMD’s Fury series, but large scale production of HBM 2 is still to come.  Samsung has released small amounts of HBM 2 to the market, but not nearly enough to handle the needs of a mass produced card.  GDDR-5X is an extension of GDDR-5 that does offer more bandwidth, but it is still not a next generation memory technology like HBM 2.

By utilizing a tiler NVIDIA is able to lower memory bandwidth needs for the rasterization stage. Considering that both Maxwell and Pascal architectures are based on GDDR-5 and 5x technologies, it makes sense to save as much bandwidth as possible where they can.  This is again probably one, among many, of the reasons that we saw a much larger L2 cache in Maxwell vs. Kepler (2048 KB vs. 256KB respectively).  Every little bit helps when we are looking at hard, real world bandwidth limits for a modern GPU.

The area of power efficiency has also come up in discussion when going to a tiler.  Tilers have traditionally been more power efficient as well due to how the raster data is tiled and cached, requiring fewer reads and writes to main memory.  The first impulse is to say, “Hey, this is the reason why NVIDIA’s Maxwell was so much more power efficient than Kepler and AMD’s latest parts!”  Sadly, this is not exactly true.  The tiler is more power efficient, but it is a small part to the power savings on a GPU.

View Full Size

The second fastest Pascal based card...

A modern GPU is very complex.  There are some 7.2 billion transistors on the latest Pascal GP-104 that powers the GTX 1080.  The vast majority of those transistors are implemented in the shader units of the chip.  While the raster units are very important, they are but a fraction of that transistor budget.  The rest is taken up by power regulation, PCI-E controllers, and memory controllers.  In the big scheme of things the raster portion is going to be dwarfed in power consumption by the shader units.  This does not mean that they are not important though.  Going back to the hated car analogy, one does not achieve weight savings by focusing on one aspect alone.  It is going over every single part of the car and shaving ounces here and there, and in the end achieving significant savings by addressing every single piece of a complex product.

This does appear to be the long and short of it.  This is one piece of a very complex ASIC that improves upon memory bandwidth utilization and power efficiency.  It is not the whole story, but it is an important part.  I find it interesting that NVIDIA did not disclose this change to editors with the introduction of Maxwell and Pascal, but if it is transparent to users and developers alike then there is no need.  There is a lot of “secret sauce” that goes into each architecture, and this is merely one aspect.  The one question that I do have is how much of the technology is based upon the Gigapixel IP that 3dfx bought at such a premium?  I believe that particular tiler was an immediate mode renderer as well due to it not having as many driver and overhead issues that PowerVR exhibited back in the day.  Obviously it would not be a copy/paste of the technology that was developed back in the 90s, it would be interesting to see if it was the basis for this current implementation.

Video News

August 2, 2016 | 06:40 PM - Posted by JohnGR

Funny. One of GTX 970's wrong specs was the cache on the chip being 1.75MB instead of 2MB. I believe no one was giving any attention to that, but maybe it is also affecting this card's performance compared to GTX 980.

Anyway, having seen many talking about Nvidia cheating again, I guess this time there is no cheating here. Just a good job done by Nvidia, because I haven't seen any articles about lower visual quality on Maxwell/Pascal cards. Only with GTX 970 in Doom, but that was probably a different story.

August 3, 2016 | 10:05 PM - Posted by flippityfloppit...

C'mon, tell us how you really feel.

August 2, 2016 | 07:54 PM - Posted by Toysrme (not verified)

I've been saying for now modern desktop GPU's needed to license Tile Based Rendering patents from Imagination Technologies (ST Micro and the Kyro I, II, III cards started it and IT's PowerVR mobile GPU's use it!)

TBR is what let my $150usd Kyro-II card with 2 year old TNT 2 Ultra hardware specs perform and routinely beat the $350usd Geforce 2 GTS; Even sometimes the $500usd Geforce 2 Ultra!!!

The obvious downside was no (then modern) T&L engine to offload from the CPU. But at the time reasonably fast CPU's would make up for it.

You can still see the effects it brought in Anandtech's Kyro-II review from March, 2001!

August 3, 2016 | 01:32 AM - Posted by renz (not verified)

probably pretty much everyone have the IP for TBR (intel, ARM, Qualcomm, Nvidia). so there is no need to license them from Imagination. not sure about AMD though.

August 2, 2016 | 10:06 PM - Posted by Anonymous (not verified)

What was the exact make and model of the AMD card that Kanter used in his testing? And what about AMD's primitive discard accelerator on its Polaris GPU micro-architecture. Also how AMD’s GPU draws the final image on the screen and how things work inside AMD’s GPUs may be totally different.

Too many online press outlets are carrying this video without some fact checking and background on just what GPU SKUs where tested. There is great argument even in Kanter's blog(See the Posts) about Kanter's conclusions Kanter needs to provide more background for the test/GPUs used in an readable format.

Any of the websites that have picked up on this story need to get the code and run some tests of more and newer GCN GPUs, and at least do some more fact checking because too many questions remain.

August 2, 2016 | 11:38 PM - Posted by Anonymous (not verified)

Fact check on something that makes Nvidia look good BLASPHEMY!!!

This is PCPERSPECTIVE be gone with your common sense approach. BIOS or BUST!!!

August 2, 2016 | 11:41 PM - Posted by Anonymous (not verified)

Ryan and company haven't fact checked Asynchronous Compute on Maxwell let alone followed up on it in Pascal when Tom Petersen refused to talk about it.

Why ruin their long running streak.

August 3, 2016 | 12:20 AM - Posted by Josh Walrath

Have not had a chance yet to reproduce Kanter's results.  I don't have a RX 480 to do any comparisons.  Ryan is finishing up the new Titan X and packing for QuakeCon.

But I gotta ask this question... how is covering this tech showing NV in a positive light?  This is merely an interesting development that NV put into their latest GPUs, but in the end until we get full disclosure from NV about the exact architectural changes this is merely a good theory on observed behavior.  This does not make NV any better or worse than AMD with this design, just that it is different from what we have seen previously.  There could very well be drawbacks to the design that they do not want known (like the GTX 970) and may not actually improve performance over their previous design.  As mentioned, Polaris also has some improvements on their end, but they do not look to use a tiler like NV.  This is neither good nor bad, it is just different.

In the end, more information is better.

Kanter used an older 6000 series GPU, so obviously there have been improvements... but the test is not to measure performance.  It is used simply as a point of information about NVIDIA changing around their rasterization techniques to utilize a more tiler based solution that does more on-chip caching.  I don't think anywhere I commented along the lines of, "AMD is behind here, they are being outclassed by NV."  This is about breaking down the architecture of two GPU generations that have had an interesting change from the past 20 years of desing for NV.

August 3, 2016 | 02:12 AM - Posted by Anonymous (not verified)

We all know that Nvidia's Maxwell has a design based more towards the mobile market, as Nvidia stated that in Maxwell’s marketing material when Maxwell was first introduced/released. And it's not about showing Nvidia's already known power usage metrics from Maxwell that where better than AMD's power usage metrics. But really Kanter could not be bothered to even list in HTML/text format what AMD GPU he was using in the form of using the full model number and of generation of micro-architecture the AMD GPU was based on. Even Kanter in his blog admits that the video was put together quickly. AMD’s GPUs available in the same time frame as Maxwell and Pascal should have been tested or Kanter should have explained why he was using a 6000 series TeraScale AMD GPU and not any GCN SKUs just to see. It’s that omission from Kanter that is most suspicious and the many websites that referenced Kanter’s article should have listed that fact and not just only referenced the Kanter video.

I look for any omissions of information as suspect and listing just a video, without any readable hardware specifications of GPUs used for folks to see without having to go over the entire video to get at the information would have been a more proper way of doing this. There will be a need for more fact checking of any newer AMD GPU SKUs to remove any suspicions that many are expressing, and both AMD and Nvidia need to supply more detailed whitepapers on their respective new hardware. Any articles concerning GPU “tiling” and rendering technologies need sufficient background checking especially in light of the new technologies that are out there from both AMD and Nvidia.
Kanter should have mentioned any new features from AMD’s Polaris micro-architecture as well as the same for Nvidia’s Pascal. AMD's Polaris, and previous GCN micro-architectures, GPUs where not tested.

Even the 6000 series number is not is not enough as TeraScale 2(HD5000-HD6870), or TeraScale 3(HD6930-to whatever), can not be determined, even among the rebrands there is some overlapping of numbers between TeraScale 3 and 2 based SKUs rebrand marketing numbers of AMD’s GPUs for more confusion. Kanter’s video only expresses Kantor’s theory and without proper fact checking more research is necessary to verify that theory’s validity. And still I’ll stress that there should have been tests using the Maxwell generation’s nearest AMD GCN GPU SKU to the actual Nvidia SKUs used in the video. Kanter’s article is very interesting none the less, but is still not complete enough because it uses a very obsolete AMD technology and does not account for any of Kantor’s reasoning as to why he chose such an old AMD technology.

AMD’s SKUs where always more used for some of AMD’s extra compute in AMD’s processors so there is bound to be more power used if there are more 32 bit, or 64 bit, ALUs and other extra hardware features now being made use of by Vulkan/DX12 optimized games(for GCN based SKUs). AMD’s GPUs where popular among the bitcoin miners before the dedicated Bitcoin ASICs where available and those AMD GPU supplies shortages happened around the same time that the Maxwell competing products where on the market. Nvidia tunes its consumer SKUs more towards gaming only workloads while AMD’s consumer SKUs have always had a little more compute resources available for other compute usage as well as gaming. In the end its more about the other websites usage of Kanter’s video without any necessary background and information omissions that is the main issue, and some websites are always pushing the AMD and Nvidia competition.

August 3, 2016 | 04:29 AM - Posted by kenjo

Its not important what AMD chip was used he used it as an example of what is normally done.

He could have used an earlier nvidia part or an intel part it was not important so he probably only used what he had available.

What he did was showing that nvidia has started to use some tile based algorithms and he tried to explain the difference.

The information given was more than enough for anyone to repeat the test themselves if they want to.

August 3, 2016 | 11:10 AM - Posted by Anonymous (not verified)

And that retesting needs to be done and published by as many websites that published links to Kantor's video. Or there will be more mistrust of the online news sources. There are probably some benchmarking software packages that have similar code to test a GPUs cache and memory subsystems and the handling of caching/tiling on GPUs, there are plenty for CPUs that test such caching things. There will be plenty of user testing also, but that video needs to have some testing information published in readable form with the associated testing rig info, and driver info, and GPU's do have some micro-code loaded from firmware abilities for some of their GPU functional blocks/units, so firmware, and driver versions are needed. Proper testing methods transparency is needed at all times with any published testing online.

He should have used earlier Nvidia parts, and newer AMD parts and made a more academic/peer review focused sort of presentation. So maybe in the future he will look at more generations of both Nvidia's and AMD's GPUs with the same software to look at the evolutions in GPU design. Tiled rendering is not a new GPU technology by any means, but some implementations may be new and have some potential and some drawbacks. Kanter should have mentioned ARM Holdings Mali/Bifrost GPU micro-architecture as well as any of the newer/latest GPU micro-architectures from the mobile and PC market to provide more background for the video presentation.

The press is also ignoring Nvidia’s Denver custom ARMv8A running micro-architecture and well as the entire custom ARM market that only licenses the ARMv8A ISA from ARM Holding’s with each respective custom ARMv8A licensee creating their own custom micro-architecture that is engineered to run the ARMv8A ISA. Nvidia’s Denver and AMD’s K12 need to be followed more closely. Nvidia will be at Hot Chips this year with its latest Tegra Denver/Pascal design.

August 3, 2016 | 01:00 PM - Posted by kenjo

What on earth are you going on about ?

It was not a benchmark. It was a small test program used to show a small detail in how graphics was rendered on newer nvidia parts. That is all it was.

August 4, 2016 | 09:36 PM - Posted by David Kanter (not verified)

1. Unlike Ryan, I don't have many GPUs lying around. But others have now run those tests.

2. I said what GPUs I used, I showed it in the video. Maybe you should watch the whole thing before you cast aspersions? It's a 6670.


August 5, 2016 | 04:10 PM - Posted by pdjblum

Thanks Josh. Well said.

August 4, 2016 | 09:45 PM - Posted by David Kanter (not verified)

FWIW, I now named the GPUs in the text.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.