Realworldtech with Compelling Evidence
Yesterday David Kanter of Realworldtech posted a pretty fascinating article and video that explored the two latest NVIDIA architectures and how they have branched away from the traditional immediate mode rasterization units. It has revealed through testing that with Maxwell and Pascal NVIDIA has gone to a tiling method with rasterization. This is a somewhat significant departure for the company considering they have utilized the same basic immediate mode rasterization model since the 90s.
The Videologic Apocolypse 3Dx based on the PowerVR PCX2.
(photo courtesy of Wikipedia)
Tiling is an interesting subject and we can harken back to the PowerVR days to see where it was first implemented. There are many advantages to tiling and deferred rendering when it comes to overall efficiency in power and memory bandwidth. These first TBDR (Tile Based Deferred Renderers) offered great performance per clock and could utilize slower memory as compared to other offerings of the day (namely Voodoo Graphics). There were some significant drawbacks to the technology. Essentially a lot of work had to be done by the CPU and driver in scene setup and geometry sorting. On fast CPU systems the PowerVR boards could provide very good performance, but it suffered on lower end parts as compared to the competition. This is a very simple explanation of what is going on, but the long and short of it is that TBDR did not take over the world due to limitations in its initial implementations. Traditional immediate mode rasters would improve in efficiency and performance with aggressive Z checks and other optimizations that borrow from the TBDR playbook.
Tiling is also present in a lot of mobile parts. Imagination’s PowerVR graphics technologies have been implemented by others such as Intel, Apple, Mediatek, and others. Qualcomm (Adreno) and ARM (Mali) both implement tiler technologies to improve power consumption and performance while increasing bandwidth efficiency. Perhaps most interestingly we can remember back to the Gigapixel days with the GP-1 chip that implemented a tiling method that seemed to work very well without the CPU hit and driver overhead that had plagued the PowerVR chips up to that point. 3dfx bought Gigapixel for some $150 million at the time. That company then went on to file bankruptcy a year later and their IP was acquired by NVIDIA.
Screenshot of the program used to uncover the tiling behavior of the rasterizer.
It now appears as though NVIDIA has evolved their raster units to embrace tiling. This is not a full TBDR implementation, but rather an immediate mode tiler that will still break up the scene in tiles but does not implement deferred rendering. This change should improve bandwidth efficiency when it comes to rasterization, but it does not affect the rest of the graphics pipeline by forcing it to be deferred (tessellation, geometry setup and shaders, etc. are not impacted). NVIDIA has not done a deep dive on this change for editors, so we do not know the exact implementation and what advantages we can expect. We can look at the evidence we have and speculate where those advantages exist.
The video where David Kanter explains his findings
Bandwidth and Power
Tilers have typically taken the tiled regions and buffered them on the chip. This is a big improvement in both performance and power efficiency as the raster data does not have to be cached and written out to the frame buffer and then swapped back. This makes quite a bit of sense considering the overall lack of big jumps in memory technologies over the past five years. We have had GDDR-5 since 2007/2008. The speeds have increased over time, but the basic technology is still much the same. We have seen HBM introduced with AMD’s Fury series, but large scale production of HBM 2 is still to come. Samsung has released small amounts of HBM 2 to the market, but not nearly enough to handle the needs of a mass produced card. GDDR-5X is an extension of GDDR-5 that does offer more bandwidth, but it is still not a next generation memory technology like HBM 2.
By utilizing a tiler NVIDIA is able to lower memory bandwidth needs for the rasterization stage. Considering that both Maxwell and Pascal architectures are based on GDDR-5 and 5x technologies, it makes sense to save as much bandwidth as possible where they can. This is again probably one, among many, of the reasons that we saw a much larger L2 cache in Maxwell vs. Kepler (2048 KB vs. 256KB respectively). Every little bit helps when we are looking at hard, real world bandwidth limits for a modern GPU.
The area of power efficiency has also come up in discussion when going to a tiler. Tilers have traditionally been more power efficient as well due to how the raster data is tiled and cached, requiring fewer reads and writes to main memory. The first impulse is to say, “Hey, this is the reason why NVIDIA’s Maxwell was so much more power efficient than Kepler and AMD’s latest parts!” Sadly, this is not exactly true. The tiler is more power efficient, but it is a small part to the power savings on a GPU.
The second fastest Pascal based card...
A modern GPU is very complex. There are some 7.2 billion transistors on the latest Pascal GP-104 that powers the GTX 1080. The vast majority of those transistors are implemented in the shader units of the chip. While the raster units are very important, they are but a fraction of that transistor budget. The rest is taken up by power regulation, PCI-E controllers, and memory controllers. In the big scheme of things the raster portion is going to be dwarfed in power consumption by the shader units. This does not mean that they are not important though. Going back to the hated car analogy, one does not achieve weight savings by focusing on one aspect alone. It is going over every single part of the car and shaving ounces here and there, and in the end achieving significant savings by addressing every single piece of a complex product.
This does appear to be the long and short of it. This is one piece of a very complex ASIC that improves upon memory bandwidth utilization and power efficiency. It is not the whole story, but it is an important part. I find it interesting that NVIDIA did not disclose this change to editors with the introduction of Maxwell and Pascal, but if it is transparent to users and developers alike then there is no need. There is a lot of “secret sauce” that goes into each architecture, and this is merely one aspect. The one question that I do have is how much of the technology is based upon the Gigapixel IP that 3dfx bought at such a premium? I believe that particular tiler was an immediate mode renderer as well due to it not having as many driver and overhead issues that PowerVR exhibited back in the day. Obviously it would not be a copy/paste of the technology that was developed back in the 90s, it would be interesting to see if it was the basis for this current implementation.
Subject: Graphics Cards | August 1, 2016 - 03:39 PM | Sebastian Peak
Tagged: pascal, nvidia, notebooks, mobile gpu, mobile gaming, laptops, GTX 1080M, GTX 1070M, GTX 1060M, discrete gpu
VideoCardz is reporting that an official announcement of the rumored mobile GPUs might be coming at Gamescom later this month.
"Mobile Pascal may arrive at Gamescom in Europe. According to DigiTimes, NVIDIA would allow its notebook partners to unveil mobile Pascal between August 17th to 21st, so just when Gamescom is hosted is hosted in Germany."
We had previously reported on the rumors of a mobile GTX 1070 and 1060, and we can only assume a 1080 will also be available (though VideoCardz is not speculating on the specs of this high-end mobile card just yet).
Rumored NVIDIA Mobile Pascal GPU specs (Image credit: VideoCardz)
Gamescom runs from August 17 - 21 in Germany, so we only have to wait about three weeks to know for sure.
Subject: Graphics Cards, Systems, Mobile | July 27, 2016 - 07:58 PM | Scott Michaud
Tagged: nvidia, Nintendo, nintendo nx, tegra, Tegra X1, tegra x2, pascal, maxwell
Okay so there's a few rumors going around, mostly from Eurogamer / DigitalFoundry, that claim the Nintendo NX is going to be powered by an NVIDIA Tegra system on a chip (SoC). DigitalFoundry, specifically, cites multiple sources who claim that their Nintendo NX development kits integrate the Tegra X1 design, as seen in the Google Pixel C. That said, the Nintendo NX release date, March 2017, does provide enough time for them to switch to NVIDIA's upcoming Pascal Tegra design, rumored to be called the Tegra X2, which uses NVIDIA's custom-designed Denver CPU cores.
Preamble aside, here's what I think about the whole situation.
First, the Tegra X1 would be quite a small jump in performance over the WiiU. The WiiU's GPU, “Latte”, has 320 shaders clocked at 550 MHz, and it was based on AMD's TeraScale 1 architecture. Because these stream processors have single-cycle multiply-add for floating point values, you can get its FLOP rating by multiplying 320 shaders, 550,000,000 cycles per second, and 2 operations per clock (one multiply and one add). This yields 352 GFLOPs. The Tegra X1 is rated at 512 GFLOPs, which is just 45% more than the previous generation.
This is a very tiny jump, unless they indeed use Pascal-based graphics. If this is the case, you will likely see a launch selection of games ported from WiiU and a few games that use whatever new feature Nintendo has. One rumor is that the console will be kind-of like the WiiU controller, with detachable controllers. If this is true, it's a bit unclear how this will affect games in a revolutionary way, but we might be missing a key bit of info that ties it all together.
As for the choice of ARM over x86... well. First, this obviously allows Nintendo to choose from a wider selection of manufacturers than AMD, Intel, and VIA, and certainly more than IBM with their previous, Power-based chips. That said, it also jives with Nintendo's interest in the mobile market. They joined The Khronos Group and I'm pretty sure they've said they are interested in Vulkan, which is becoming the high-end graphics API for Android, supported by Google and others. That said, I'm not sure how many engineers exist that specialize in ARM optimization, as most mobile platforms try to abstract this as much as possible, but this could be Nintendo's attempt to settle on a standardized instruction set, and they opted for mobile over PC (versus Sony and especially Microsoft, who want consoles to follow high-end gaming on the desktop).
Why? Well that would just be speculating on speculation about speculation. I'll stop here.
Subject: Graphics Cards | July 26, 2016 - 12:36 AM | Tim Verry
Tagged: windforce, pascal, gigabyte, GeForce GTX 1060
In a recent press release, Gigabyte announced that it will soon be adding four new GTX 1060 graphics cards to its lineup. The new cards feature Windforce series coolers and custom PCBs. At the high end is the GTX 1060 G1 Gaming followed by the GTX 1060 Windforce OC, small form factor friendly GTX 1060 Mini ITX OC, and the budget minded GTX 1060 D5. While the company has yet to divulge pricing or availability, the cards should be out within the next month or two.
All of the upcoming cards use a custom design that uses a custom PCB and power phase setup paired with Gigabyte's dual – or in the case of the Mini ITX card – single fan Windforce air cooler. Unfortunately, exact specifications for all of the cards except the high end model are unknown including core and memory clocks. The coolers use a dual composite heatpipe that directly touches the GPU to pull heat away and is dissipated by an aluminum fin stack. The fans are 90mm on all of the cards with the dual fan models using a design that has each fan spinning alternate directions of the other. The cards feature 6GB of GDDR5 memory as well as DVI, HDMI, and DisplayPort video outputs. For example, the Mini ITX OC graphics card (which is only 17cm long) and features two DVI, one HDMI, and one DP output.
More information is available on the GTX 1060 G1 Gaming. This card is a dual slot dual fan design with a 6+1 power phase (reference is 3+1) powered by a single 8-pin power connector. The fans are shrouded and there is a metal backplate to aid in stability and cooling. Gigabyte claims that its "GPU Gauntlet" technology ensures users get heavily overclockable chips thanks to sorting and using the most promising chips.
The 16nm Pascal GPU is factory overclocked to 1847 MHz boost and 1620 MHz base clockspeeds in OC mode and 1809 MHz boost and 1594 MHz base in gaming mode. Users will be able to use the company's Xtreme Engine software to dial up the overclocks further as well as mess with the RGB LEDs. For comparison, the reference clockspeeds are 1708 MHz boost and 1506 MHz base. Gigabyte has left the 6GB of GDDR5 memory untouched at 8008 MHz.
The other cards should have similarly decent factory overclocks, but it is hard to say exactly what they will be out of the box. While I am not a big fan of the aesthetics, the Windforce coolers should let users push Pascal fairly far (for air cooling).
I would guess that the Gigabyte GTX 1060 G1 Gaming will MSRP for just above $300 while the lower end cards will be around $260 (the Mini ITX OC may be at a slight premium above that).
What do you think about Gigabyte's new cards?
Subject: Graphics Cards | July 22, 2016 - 05:51 PM | Scott Michaud
Tagged: pascal, nvidia, graphics drivers
Turns out the Pascal-based GPUs suffered from DPC latency issues, and there's been an ongoing discussion about it for a little over a month. This is not an area that I know a lot about, but it's a system that schedules workloads by priority, which provides regular windows of time for sound and video devices to update. It can be stalled by long-running driver code, though, which could manifest as stutter, audio hitches, and other performance issues. With a 10-series GeForce device installed, users have reported that this latency increases about 10-20x, from ~20us to ~300-400us. This can increase to 1000us or more under load. (8333us is ~1 whole frame at 120FPS.)
NVIDIA has acknowledged the issue and, just yesterday, released an optional hotfix. Upon installing the driver, while it could just be psychosomatic, the system felt a lot more responsive. I ran LatencyMon (DPCLat isn't compatible with Windows 8.x or Windows 10) before and after, and the latency measurement did drop significantly. It was consistently the largest source of latency, spiking in the thousands of microseconds, before the update. After the update, it was hidden by other drivers for the first night, although today it seems to have a few spikes again. That said, Microsoft's networking driver is also spiking in the ~200-300us range, so a good portion of it might be the sad state of my current OS install. I've been meaning to do a good system wipe for a while...
Measurement taken after the hotfix, while running Spotify.
That said, my computer's a mess right now.
That said, some of the post-hotfix driver spikes are reaching ~570us (mostly when I play music on Spotify through my Blue Yeti Pro). Also, Photoshop CC 2015 started complaining about graphics acceleration issues after installing the hotfix, so only install it if you're experiencing problems. About the latency, if it's not just my machine, NVIDIA might still have some work to do.
It does feel a lot better, though.
Subject: Graphics Cards | July 21, 2016 - 10:21 PM | Ryan Shrout
Tagged: titan x, titan, pascal, nvidia, gp102
Donning the leather jacket he goes very few places without, NVIDIA CEO Jen-Hsun Huang showed up at an AI meet-up at Stanford this evening to show, for the very first time, a graphics card based on a never before seen Pascal GP102 GPU.
Source: Twitter (NVIDIA)
Rehashing an old name, NVIDIA will call this new graphics card the Titan X. You know, like the "new iPad" this is the "new TitanX." Here is the data we know about thus far:
|Titan X (Pascal)||GTX 1080||GTX 980 Ti||TITAN X||GTX 980||R9 Fury X||R9 Fury||R9 Nano||R9 390X|
|GPU||GP102||GP104||GM200||GM200||GM204||Fiji XT||Fiji Pro||Fiji XT||Hawaii XT|
|Rated Clock||1417 MHz||1607 MHz||1000 MHz||1000 MHz||1126 MHz||1050 MHz||1000 MHz||up to 1000 MHz||1050 MHz|
|Texture Units||224 (?)||160||176||192||128||256||224||256||176|
|ROP Units||96 (?)||64||96||96||64||64||64||64||64|
|Memory Clock||10000 MHz||10000 MHz||7000 MHz||7000 MHz||7000 MHz||500 MHz||500 MHz||500 MHz||6000 MHz|
|Memory Interface||384-bit G5X||256-bit G5X||384-bit||384-bit||256-bit||4096-bit (HBM)||4096-bit (HBM)||4096-bit (HBM)||512-bit|
|Memory Bandwidth||480 GB/s||320 GB/s||336 GB/s||336 GB/s||224 GB/s||512 GB/s||512 GB/s||512 GB/s||320 GB/s|
|TDP||250 watts||180 watts||250 watts||250 watts||165 watts||275 watts||275 watts||175 watts||275 watts|
|Peak Compute||11.0 TFLOPS||8.2 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||8.60 TFLOPS||7.20 TFLOPS||8.19 TFLOPS||5.63 TFLOPS|
Note: everything with a ? on is educated guesses on our part.
Obviously there is a lot for us to still learn about this new GPU and graphics card, including why in the WORLD it is still being called Titan X, rather than...just about anything else. That aside, GP102 will feature 40% more CUDA cores than the GP104 at slightly lower clock speeds. The rated 11 TFLOPS of single precision compute of the new Titan X is 34% better than that of the GeForce GTX 1080 and I would expect gaming performance to scale in line with that difference.
The new Titan X will feature 12GB of GDDR5X memory, not HBM as the GP100 chip has, so this is clearly a new chip with a new memory interface. NVIDIA claims it will have 480 GB/s of bandwidth, and I am guessing is built on a 384-bit memory controller interface running at the same 10 Gbps as the GTX 1080. It's truly amazing hardware.
What will you be asked to pay? $1200, going on sale on August 2nd, and only on NVIDIA.com, at least for now. Considering the prices of GeForce GTX 1080 cards with such limited availability, the $1200 price tag MIGHT NOT seem so insane. That's higher than the $999 starting price of the Titan X based on Maxwell in March of 2015 - the claims that NVIDIA is artificially raising prices of cards in each segment will continue, it seems.
I am curious about the TDP on the new Titan X -
will it hit the 250 watt mark of the previous version? Yes, apparently it will it that 250 watt TDP - specs above updated. Does this also mean we'll see a GeForce GTX 1080 Ti that falls between the GTX 1080 and this new Titan X? Maybe, but we are likely looking at an $899 or higher SEP - so get those wallets ready.
That's it for now; we'll have a briefing where we can get more details soon, and hopefully a review ready for you on August 2nd when the cards go on sale!
Subject: Graphics Cards | July 21, 2016 - 02:04 PM | Jeremy Hellstrom
Tagged: gtx 460, gtx 760, gtx 960, gtx 1060, fermi, kepler, maxwell, pascal
Phoronix took a look at how NVIDIA's mid range cards performance on Linux has changed over the past four generations of GPU, from Fermi, through Kepler, Maxwell, and finally Pascal. CS:GO was run at 4k to push the newer GPUs as was DOTA, much to the dismay of the GTX 460. The scaling is rather interesting, there is a very large delta between Fermi and Kepler which comes close to being replicated when comparing Maxwell to Pascal. From the looks of the vast majority of the tests, the GTX 1060 will be a noticeable upgrade for Linux users no matter which previous mid range card they are currently using. We will likely see a similar article covering AMD in the near future.
"To complement yesterday's launch-day GeForce GTX 1060 Linux review, here are some more benchmark results with the various NVIDIA x60 graphics cards I have available for testing going back to the GeForce GTX 460 Fermi. If you are curious about the raw OpenGL/OpenCL/CUDA performance and performance-per-Watt for these mid-range x60 graphics cards from Fermi, Kepler, Maxwell, and Pascal, here are these benchmarks from Ubuntu 16.04 Linux." Here are some more Graphics Card articles from around the web:
- ASUS ROG STRIX-GTX1070-O8G-GAMING: GTX 1070, Strix Style! @ Bjorn3d
- MSI GeForce GTX 1060 Gaming X Review @HiTech Legion
- EVGA GeForce GTX 1070 SC Gaming ACX 3.0 Review - Affordable Enthusiast Gaming @HiTech Legion
- Radeon RX 480 performance revisited with AMD's 16.7.1 driver @ The Tech Report
- AMD Radeon RX 480 8GB CrossFire @ [H]ard|OCP
Subject: Graphics Cards | July 19, 2016 - 01:54 PM | Jeremy Hellstrom
Tagged: pascal, nvidia, gtx 1060, gp106, geforce, founders edition
The GTX 1060 Founders Edition has arrived and also happens to be our first look at the 16nm FinFET GP106 silicon, the GTX 1080 and 1070 used GP104. This card features 10 SMs, 1280 CUDA cores, 48 ROPs and 80 texture units, in many ways it is a half of a GTX 1080. The GPU is clocked at a base of 1506MHz with a boost of 1708MHz, the 6GB of VRAM at 8GHz. [H]ard|OCP took this card through its paces, contrasting it with the RX480 and the GTX 980 at resolutions of 1440p as well as the more common 1080p. As they do not use the frame rating tools which are the basis of our graphics testing of all cards, including the GTX 1060 of course, they included the new DOOM in their test suite. Read on to see how they felt the card compared to the competition ... just don't expect to see a follow up article on SLI performance.
"NVIDIA's GeForce GTX 1060 video card is launched today in the $249 and $299 price point for the Founders Edition. We will find out how it performs in comparison to AMD Radeon RX 480 in DOOM with the Vulkan API as well as DX12 and DX11 games. We'll also see how a GeForce GTX 980 compares in real world gaming."
Here are some more Graphics Card articles from around the web:
- The NVIDIA GTX 1060 6GB Review @ Hardware Canucks
- A quick look at Nvidia's GeForce GTX 1060 @ The Tech Report
- VIDIA GeForce GTX 1060 Founders Edition Review @ OCC
- NVIDIA GeForce GTX 1060 Founder’s Edition @ Tech ARP
- NVIDIA GeForce GTX 1060 6GB Graphics Card Review @ Techgage
- GeForce GTX 1060 @ Hardwareheaven
- Nvidia GTX 1060 6GB Founders Edition @ Kitguru
- MSI GeForce GTX 1060 Gaming X 6 GB @ techPowerUp
- NVIDIA GeForce GTX 1060 6 GB @ techPowerUp
- NVIDIA GeForce GTX 1060 Review - Enthusiast Gaming at a Mainstream Price @ HiTech Legion
- NVIDIA GeForce GTX 1060 Offers Great Performance On Linux @ Phoronix
Twelve days ago, NVIDIA announced its competitor to the AMD Radeon RX 480, the GeForce GTX 1060, based on a new Pascal GPU; GP 106. Though that story was just a brief preview of the product, and a pictorial of the GTX 1060 Founders Edition card we were initially sent, it set the community ablaze with discussion around which mainstream enthusiast platform was going to be the best for gamers this summer.
Today we are allowed to show you our full review: benchmarks of the new GeForce GTX 1060 against the likes of the Radeon RX 480, the GTX 970 and GTX 980, and more. Starting at $250, the GTX 1060 has the potential to be the best bargain in the market today, though much of that will be decided based on product availability and our results on the following pages.
Does NVIDIA’s third consumer product based on Pascal make enough of an impact to dissuade gamers from buying into AMD Polaris?
All signs point to a bloody battle this July and August and the retail cards based on the GTX 1060 are making their way to our offices sooner than even those based around the RX 480. It is those cards, and not the reference/Founders Edition option, that will be the real competition that AMD has to go up against.
First, however, it’s important to find our baseline: where does the GeForce GTX 1060 find itself in the wide range of GPUs?
Subject: Graphics Cards | July 16, 2016 - 06:37 PM | Scott Michaud
Tagged: Volta, pascal, nvidia, maxwell, 16nm
For the past few generations, NVIDIA has been roughly trying to release a new architecture with a new process node, and release a refresh the following year. This ran into a hitch as Maxwell was delayed a year, apart from the GTX 750 Ti, and then pushed back to the same 28nm process that Kepler utilized. Pascal caught up with 16nm, although we know that some hard, physical limitations are right around the corner. The lattice spacing for silicon at room temperature is around ~0.5nm, so we're talking about features the size of ~the low 30s of atoms in width.
This rumor claims that NVIDIA is not trying to go with 10nm for Volta. Instead, it will take place on the same, 16nm node that Pascal is currently occupying. This is quite interesting, because GPUs scale quite well with complexity changes, as they have many features with a relatively low clock rate, so the only real ways to increase performance are to make the existing architecture more efficient, or make a larger chip.
That said, GP100 leaves a lot of room on the table for an FP32-optimized, ~600mm2 part to crush its performance at the high end, similar to how GM200 replaced GK110. The rumored GP102, expected in the ~450mm2 range for Titan or GTX 1080 Ti-style parts, has some room to grow. Like GM200, however, it would also be unappealing to GPU compute users who need FP64. If this is what is going on, and we're totally just speculating at the moment, it would signal that enterprise customers should expect a new GPGPU card every second gaming generation.
That is, of course, unless NVIDIA recognized ways to make the Maxwell-based architecture significantly more die-space efficient in Volta. Clocks could get higher, or the circuits themselves could get simpler. You would think that, especially in the latter case, they would have integrated those ideas into Maxwell and Pascal, though; but, like HBM2 memory, there might have been a reason why they couldn't.
We'll need to wait and see. The entire rumor could be crap, who knows?