All | Editorial | General Tech | Graphics Cards | Networking | Motherboards | Cases and Cooling | Processors | Chipsets | Memory | Displays | Systems | Storage | Mobile | Shows and Expos
Why Two 4GB GPUs Isn't Necessarily 8GB
We're trying something new here at PC Perspective. Some topics are fairly difficult to explain cleanly without accompanying images. We also like to go fairly deep into specific topics, so we're hoping that we can provide educational cartoons that explain these issues.
This pilot episode is about load-balancing and memory management in multi-GPU configurations. There seems to be a lot of confusion around what was (and was not) possible with DirectX 11 and OpenGL, and even more confusion about what DirectX 12, Mantle, and Vulkan allow developers to do. It highlights three different load-balancing algorithms, and even briefly mentions what LucidLogix was attempting to accomplish almost ten years ago.
If you like it, and want to see more, please share and support us on Patreon. We're putting this out not knowing if it's popular enough to be sustainable. The best way to see more of this is to share!
Is Enterprise Ascending Outside of Consumer Viability?
So a couple of weeks have gone by since the Quadro P6000 (update: was announced) and the new Titan X launched. With them, we received a new chip: GP102. Since Fermi, NVIDIA has labeled their GPU designs with a G, followed by a single letter for the architecture (F, K, M, or P for Fermi, Kepler, Maxwell, and Pascal, respectively), which is then followed by a three digit number. The last digit is the most relevant one, however, as it separates designs by their intended size.
Typically, 0 corresponds to a ~550-600mm2 design, which is about as larger of a design that fabrication labs can create without error-prone techniques, like
multiple exposures (update for clarity: trying to precisely overlap multiple designs to form a larger integrated circuit). 4 corresponds to ~300mm2, although GM204 was pretty large at 398mm2, which was likely to increase the core count while remaining on a 28nm process. Higher numbers, like 6 or 7, fill back the lower-end SKUs until NVIDIA essentially stops caring for that generation. So when we moved to Pascal, jumping two whole process nodes, NVIDIA looked at their wristwatches and said “about time to make another 300mm2 part, I guess?”
The GTX 1080 and the GTX 1070 (GP104, 314mm2) were born.
NVIDIA already announced a 600mm2 part, though. The GP100 had 3840 CUDA cores, HBM2 memory, and an ideal ratio of 1:2:4 between FP64:FP32:FP16 performance. (A 64-bit chunk of memory can store one 64-bit value, two 32-bit values, or four 16-bit values, unless the register is attached to logic circuits that, while smaller, don't know how to operate on the data.) This increased ratio, even over Kepler's 1:6 FP64:FP32, is great for GPU compute, but wasted die area for today's (and tomorrow's) games. I'm predicting that it takes the wind out of Intel's sales, as Xeon Phi's 1:2 FP64:FP32 performance ratio is one of its major selling points, leading to its inclusion in many supercomputers.
Despite the HBM2 memory controller supposedly being actually smaller than GDDR5(X), NVIDIA could still save die space while still providing 3840 CUDA cores (despite disabling a few on Titan X). The trade-off is that FP64 and FP16 performance had to decrease dramatically, from 1:2 and 2:1 relative to FP32, all the way down to 1:32 and 1:64. This new design comes in at 471mm2, although it's $200 more expensive than what the 600mm2 products, GK110 and GM200, launched at. Smaller dies provide more products per wafer, and, better, the number of defective chips should be relatively constant.
Anyway, that aside, it puts NVIDIA in an interesting position. Splitting the xx0-class chip into xx0 and xx2 designs allows NVIDIA to lower the cost of their high-end gaming parts, although it cuts out hobbyists who buy a Titan for double-precision compute. More interestingly, it leaves around 150mm2 for AMD to sneak in a design that's FP32-centric, leaving them a potential performance crown.
Image Credit: ExtremeTech
On the other hand, as fabrication node changes are becoming less frequent, it's possible that NVIDIA could be leaving itself room for Volta, too. Last month, it was rumored that NVIDIA would release two architectures at 16nm, in the same way that Maxwell shared 28nm with Kepler. In this case, Volta, on top of whatever other architectural advancements NVIDIA rolls into that design, can also grow a little in size. At that time, TSMC would have better yields, making a 600mm2 design less costly in terms of waste and recovery.
If this is the case, we could see the GPGPU folks receiving a new architecture once every second gaming (and professional graphics) architecture. That is, unless you are a hobbyist. If you are? I would need to be wrong, or NVIDIA would need to somehow bring their enterprise SKU into an affordable price point. The xx0 class seems to have been pushed up and out of viability for consumers.
Or, again, I could just be wrong.
Take your Pascal on the go
Easily the strongest growth segment in PC hardware today is in the adoption of gaming notebooks. Ask companies like MSI and ASUS, even Gigabyte, as they now make more models and sell more units of notebooks with a dedicated GPU than ever before. Both AMD and NVIDIA agree on this point and it’s something that AMD was adamant in discussing during the launch of the Polaris architecture.
Both AMD and NVIDIA predict massive annual growth in this market – somewhere on the order of 25-30%. For an overall culture that continues to believe the PC is dying, seeing projected growth this strong in any segment is not only amazing, but welcome to those of us that depend on it. AMD and NVIDIA have different goals here: GeForce products already have 90-95% market share in discrete gaming notebooks. In order for NVIDIA to see growth in sales, the total market needs to grow. For AMD, simply taking back a portion of those users and design wins would help its bottom line.
But despite AMD’s early talk about getting Polaris 10 and 11 in mobile platforms, it’s NVIDIA again striking first. Gaming notebooks with Pascal GPUs in them will be available today, from nearly every system vendor you would consider buying from: ASUS, MSI, Gigabyte, Alienware, Razer, etc. NVIDIA claims to have quicker adoption of this product family in notebooks than in any previous generation. That’s great news for NVIDIA, but might leave AMD looking in from the outside yet again.
Technologically speaking though, this makes sense. Despite the improvement that Polaris made on the GCN architecture, Pascal is still more powerful and more power efficient than anything AMD has been able to product. Looking solely at performance per watt, which is really the defining trait of mobile designs, Pascal is as dominant over Polaris as Maxwell was to Fiji. And this time around NVIDIA isn’t messing with cut back parts that have brand changes – GeForce is diving directly into gaming notebooks in a way we have only seen with one release.
The ASUS G752VS OC Edition with GTX 1070
Do you remember our initial look at the mobile variant of the GeForce GTX 980? Not the GTX 980M mind you, the full GM204 operating in notebooks. That was basically a dry run for what we see today: NVIDIA will be releasing the GeForce GTX 1080, GTX 1070 and GTX 1060 to notebooks.
A Beautiful Graphics Card
As a surprise to nearly everyone, on July 21st NVIDIA announced the existence of the new Titan X graphics cards, which are based on the brand new GP102 Pascal GPU. Though it shares a name, for some unexplained reason, with the Maxwell-based Titan X graphics card launched in March of 2015, this is card is a significant performance upgrade. Using the largest consumer-facing Pascal GPU to date (with only the GP100 used in the Tesla P100 exceeding it), the new Titan X is going to be a very expensive, and very fast gaming card.
As has been the case since the introduction of the Titan brand, NVIDIA claims that this card is for gamers that want the very best in graphics hardware as well as for developers and need an ultra-powerful GPGPU device. GP102 does not integrate improved FP64 / double precision compute cores, so we are basically looking at an upgraded and improved GP104 Pascal chip. That’s nothing to sneeze at, of course, and you can see in the specifications below that we expect (and can now show you) Titan X (Pascal) is a gaming monster.
|Titan X (Pascal)||GTX 1080||GTX 980 Ti||TITAN X||GTX 980||R9 Fury X||R9 Fury||R9 Nano||R9 390X|
|GPU||GP102||GP104||GM200||GM200||GM204||Fiji XT||Fiji Pro||Fiji XT||Hawaii XT|
|Rated Clock||1417 MHz||1607 MHz||1000 MHz||1000 MHz||1126 MHz||1050 MHz||1000 MHz||up to 1000 MHz||1050 MHz|
|Memory Clock||10000 MHz||10000 MHz||7000 MHz||7000 MHz||7000 MHz||500 MHz||500 MHz||500 MHz||6000 MHz|
|Memory Interface||384-bit G5X||256-bit G5X||384-bit||384-bit||256-bit||4096-bit (HBM)||4096-bit (HBM)||4096-bit (HBM)||512-bit|
|Memory Bandwidth||480 GB/s||320 GB/s||336 GB/s||336 GB/s||224 GB/s||512 GB/s||512 GB/s||512 GB/s||320 GB/s|
|TDP||250 watts||180 watts||250 watts||250 watts||165 watts||275 watts||275 watts||175 watts||275 watts|
|Peak Compute||11.0 TFLOPS||8.2 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||8.60 TFLOPS||7.20 TFLOPS||8.19 TFLOPS||5.63 TFLOPS|
GP102 features 40% more CUDA cores than the GP104 at slightly lower clock speeds. The rated 11 TFLOPS of single precision compute of the new Titan X is 34% higher than that of the GeForce GTX 1080 and I would expect gaming performance to scale in line with that difference.
Titan X (Pascal) does not utilize the full GP102 GPU; the recently announced Pascal P6000 does, however, which gives it a CUDA core count of 3,840 (256 more than Titan X).
A full GP102 GPU
The complete GPU effectively loses 7% of its compute capability with the new Titan X, although that is likely to help increase available clock headroom and yield.
The new Titan X will feature 12GB of GDDR5X memory, not HBM as the GP100 chip has, so this is clearly a unique chip with a new memory interface. NVIDIA claims it has 480 GB/s of bandwidth on a 384-bit memory controller interface running at the same 10 Gbps as the GTX 1080.
Realworldtech with Compelling Evidence
Yesterday David Kanter of Realworldtech posted a pretty fascinating article and video that explored the two latest NVIDIA architectures and how they have branched away from the traditional immediate mode rasterization units. It has revealed through testing that with Maxwell and Pascal NVIDIA has gone to a tiling method with rasterization. This is a somewhat significant departure for the company considering they have utilized the same basic immediate mode rasterization model since the 90s.
The Videologic Apocolypse 3Dx based on the PowerVR PCX2.
(photo courtesy of Wikipedia)
Tiling is an interesting subject and we can harken back to the PowerVR days to see where it was first implemented. There are many advantages to tiling and deferred rendering when it comes to overall efficiency in power and memory bandwidth. These first TBDR (Tile Based Deferred Renderers) offered great performance per clock and could utilize slower memory as compared to other offerings of the day (namely Voodoo Graphics). There were some significant drawbacks to the technology. Essentially a lot of work had to be done by the CPU and driver in scene setup and geometry sorting. On fast CPU systems the PowerVR boards could provide very good performance, but it suffered on lower end parts as compared to the competition. This is a very simple explanation of what is going on, but the long and short of it is that TBDR did not take over the world due to limitations in its initial implementations. Traditional immediate mode rasters would improve in efficiency and performance with aggressive Z checks and other optimizations that borrow from the TBDR playbook.
Tiling is also present in a lot of mobile parts. Imagination’s PowerVR graphics technologies have been implemented by others such as Intel, Apple, Mediatek, and others. Qualcomm (Adreno) and ARM (Mali) both implement tiler technologies to improve power consumption and performance while increasing bandwidth efficiency. Perhaps most interestingly we can remember back to the Gigapixel days with the GP-1 chip that implemented a tiling method that seemed to work very well without the CPU hit and driver overhead that had plagued the PowerVR chips up to that point. 3dfx bought Gigapixel for some $150 million at the time. That company then went on to file bankruptcy a year later and their IP was acquired by NVIDIA.
Screenshot of the program used to uncover the tiling behavior of the rasterizer.
It now appears as though NVIDIA has evolved their raster units to embrace tiling. This is not a full TBDR implementation, but rather an immediate mode tiler that will still break up the scene in tiles but does not implement deferred rendering. This change should improve bandwidth efficiency when it comes to rasterization, but it does not affect the rest of the graphics pipeline by forcing it to be deferred (tessellation, geometry setup and shaders, etc. are not impacted). NVIDIA has not done a deep dive on this change for editors, so we do not know the exact implementation and what advantages we can expect. We can look at the evidence we have and speculate where those advantages exist.
The video where David Kanter explains his findings
Bandwidth and Power
Tilers have typically taken the tiled regions and buffered them on the chip. This is a big improvement in both performance and power efficiency as the raster data does not have to be cached and written out to the frame buffer and then swapped back. This makes quite a bit of sense considering the overall lack of big jumps in memory technologies over the past five years. We have had GDDR-5 since 2007/2008. The speeds have increased over time, but the basic technology is still much the same. We have seen HBM introduced with AMD’s Fury series, but large scale production of HBM 2 is still to come. Samsung has released small amounts of HBM 2 to the market, but not nearly enough to handle the needs of a mass produced card. GDDR-5X is an extension of GDDR-5 that does offer more bandwidth, but it is still not a next generation memory technology like HBM 2.
By utilizing a tiler NVIDIA is able to lower memory bandwidth needs for the rasterization stage. Considering that both Maxwell and Pascal architectures are based on GDDR-5 and 5x technologies, it makes sense to save as much bandwidth as possible where they can. This is again probably one, among many, of the reasons that we saw a much larger L2 cache in Maxwell vs. Kepler (2048 KB vs. 256KB respectively). Every little bit helps when we are looking at hard, real world bandwidth limits for a modern GPU.
The area of power efficiency has also come up in discussion when going to a tiler. Tilers have traditionally been more power efficient as well due to how the raster data is tiled and cached, requiring fewer reads and writes to main memory. The first impulse is to say, “Hey, this is the reason why NVIDIA’s Maxwell was so much more power efficient than Kepler and AMD’s latest parts!” Sadly, this is not exactly true. The tiler is more power efficient, but it is a small part to the power savings on a GPU.
The second fastest Pascal based card...
A modern GPU is very complex. There are some 7.2 billion transistors on the latest Pascal GP-104 that powers the GTX 1080. The vast majority of those transistors are implemented in the shader units of the chip. While the raster units are very important, they are but a fraction of that transistor budget. The rest is taken up by power regulation, PCI-E controllers, and memory controllers. In the big scheme of things the raster portion is going to be dwarfed in power consumption by the shader units. This does not mean that they are not important though. Going back to the hated car analogy, one does not achieve weight savings by focusing on one aspect alone. It is going over every single part of the car and shaving ounces here and there, and in the end achieving significant savings by addressing every single piece of a complex product.
This does appear to be the long and short of it. This is one piece of a very complex ASIC that improves upon memory bandwidth utilization and power efficiency. It is not the whole story, but it is an important part. I find it interesting that NVIDIA did not disclose this change to editors with the introduction of Maxwell and Pascal, but if it is transparent to users and developers alike then there is no need. There is a lot of “secret sauce” that goes into each architecture, and this is merely one aspect. The one question that I do have is how much of the technology is based upon the Gigapixel IP that 3dfx bought at such a premium? I believe that particular tiler was an immediate mode renderer as well due to it not having as many driver and overhead issues that PowerVR exhibited back in the day. Obviously it would not be a copy/paste of the technology that was developed back in the 90s, it would be interesting to see if it was the basis for this current implementation.
Yes, We're Writing About a Forum Post
Update - July 19th @ 7:15pm EDT: Well that was fast. Futuremark published their statement today. I haven't read it through yet, but there's no reason to wait to link it until I do.
Update 2 - July 20th @ 6:50pm EDT: We interviewed Jani Joki, Futuremark's Director of Engineering, on our YouTube page. The interview is embed just below this update.
Original post below
The comments of a previous post notified us of an Overclock.net thread, whose author claims that 3DMark's implementation of asynchronous compute is designed to show NVIDIA in the best possible light. At the end of the linked post, they note that asynchronous compute is a general blanket, and that we should better understand what is actually going on.
So, before we address the controversy, let's actually explain what asynchronous compute is. The main problem is that it actually is a broad term. Asynchronous compute could describe any optimization that allows tasks to execute when it is most convenient, rather than just blindly doing them in a row.
This is asynchronous computing.
Twelve days ago, NVIDIA announced its competitor to the AMD Radeon RX 480, the GeForce GTX 1060, based on a new Pascal GPU; GP 106. Though that story was just a brief preview of the product, and a pictorial of the GTX 1060 Founders Edition card we were initially sent, it set the community ablaze with discussion around which mainstream enthusiast platform was going to be the best for gamers this summer.
Today we are allowed to show you our full review: benchmarks of the new GeForce GTX 1060 against the likes of the Radeon RX 480, the GTX 970 and GTX 980, and more. Starting at $250, the GTX 1060 has the potential to be the best bargain in the market today, though much of that will be decided based on product availability and our results on the following pages.
Does NVIDIA’s third consumer product based on Pascal make enough of an impact to dissuade gamers from buying into AMD Polaris?
All signs point to a bloody battle this July and August and the retail cards based on the GTX 1060 are making their way to our offices sooner than even those based around the RX 480. It is those cards, and not the reference/Founders Edition option, that will be the real competition that AMD has to go up against.
First, however, it’s important to find our baseline: where does the GeForce GTX 1060 find itself in the wide range of GPUs?
Through the looking glass
Futuremark has been the most consistent and most utilized benchmark company for PCs for quite a long time. While other companies have faltered and faded, Futuremark continues to push forward with new benchmarks and capabilities in an attempt to maintain a modern way to compare performance across platforms with standardized tests.
Back in March of 2015, 3DMark added support for an API Overhead test to help gamers and editors understand the performance advantages of Mantle and DirectX 12 compared to existing APIs. Though the results were purely “peak theoretical” numbers, the data helped showcase to consumers and developers what low levels APIs brought to the table.
Today Futuremark is releasing a new benchmark that focuses on DX12 gaming. No longer just a feature test, Time Spy is a fully baked benchmark with its own rendering engine and scenarios for evaluating the performance of graphics cards and platforms. It requires Windows 10 and a DX12-capable graphics card, and includes two different graphics tests and a CPU test. Oh, and of course, there is a stunningly gorgeous demo mode to go along with it.
I’m not going to spend much time here dissecting the benchmark itself, but it does make sense to have an idea of what kind of technologies are built into the game engine and tests. The engine is based purely on DX12, and integrates technologies like asynchronous compute, explicit multi-adapter and multi-threaded workloads. These are highly topical ideas and will be the focus of my testing today.
Futuremark provides an interesting diagram to demonstrate the advantages DX12 has over DX11. Below you will find a listing of the average number of vertices, triangles, patches and shader calls in 3DMark Fire Strike compared with 3DMark Time Spy.
It’s not even close here – the new Time Spy engine has more than a factor of 10 more processing calls for some of these items. As Futuremark states, however, this kind of capability isn’t free.
With DirectX 12, developers can significantly improve the multi-thread scaling and hardware utilization of their titles. But it requires a considerable amount of graphics expertise and memory-level programming skill. The programming investment is significant and must be considered from the start of a project.
Radeon Software 16.7.1 Adjustments
Last week we posted a story that looked at a problem found with the new AMD Radeon RX 480 graphics card’s power consumption. The short version of the issue was that AMD’s new Polaris 10-based reference card was drawing more power than its stated 150 watt TDP and that it was drawing more power through the motherboard PCI Express slot that the connection was rated for. And sometimes that added power draw was significant, both at stock settings and overclocked. Seeing current draw over a connection rated at just 5.5A peaking over 7A at stock settings raised an alarm (validly) and our initial report detailed the problem very specifically.
AMD responded initially that “everything was fine here” but the company eventually saw the writing on the wall and started to work on potential solutions. The Radeon RX 480 is a very important product for the future of Radeon graphics and this was a launch that needs to be as perfect as it can be. Though the risk to users’ hardware with the higher than expected current draw is muted somewhat by motherboard-based over-current protection, it’s crazy to think that AMD actually believed that was the ideal scenario. Depending on the “circuit breaker” in any system to save you when standards exists for exactly that purpose is nuts.
Today AMD has released a new driver, version 16.7.1, that actually introduces a pair of fixes for the problem. One of them is hard coded into the software and adjusts power draw from the different +12V sources (PCI Express slot and 6-pin connector) while the other is an optional flag in the software that is disabled by default.
Reconfiguring the power phase controller
The Radeon RX 480 uses a very common power controller (IR3567B) on its PCB to cycle through the 6 power phases providing electricity to the GPU itself. Allyn did some simple multimeter trace work to tell us which phases were connected to which sources and the result is seen below.
The power controller is responsible for pacing the power coming in from the PCI Express slot and the 6-pin power connection to the GPU, in phases. Phases 1-3 come in from the power supply via the 6-pin connection, while phases 4-6 source power from the motherboard directly. At launch, the RX 480 drew nearly identical amounts of power from both the PEG slot and the 6-pin connection, essentially giving each of the 6 phases at work equal time.
That might seem okay, but it’s far from the standard of what we have seen in the past. In no other case have we measured a graphics card drawing equal power from the PEG slot as from an external power connector on the card. (Obviously for cards without external power connections, that’s a different discussion.) In general, with other AMD and NVIDIA based graphics cards, the motherboard slot would provide no more than 50-60 watts of power, while any above that would come from the 6/8-pin connections on the card. In many cases I saw that power draw through the PEG slot was as low as 20-30 watts if the external power connections provided a lot of overage for the target TDP of the product.
It’s probably not going to come as a surprise to anyone that reads the internet, but NVIDIA is officially taking the covers off its latest GeForce card in the Pascal family today, the GeForce GTX 1060. As the number scheme would suggest, this is a more budget-friendly version of NVIDIA’s latest architecture, lowering performance in line with expectations. The GP106-based GPU will still offer impressive specifications and capabilities and will probably push AMD’s new Radeon RX 480 to its limits.
Let’s take a quick look at the card’s details.
|GTX 1060||RX 480||R9 390||R9 380||GTX 980||GTX 970||GTX 960||R9 Nano||GTX 1070|
|GPU||GP106||Polaris 10||Grenada||Tonga||GM204||GM204||GM206||Fiji XT||GP104|
|Rated Clock||1506 MHz||1120 MHz||1000 MHz||970 MHz||1126 MHz||1050 MHz||1126 MHz||up to 1000 MHz||1506 MHz|
|Texture Units||80 (?)||144||160||112||128||104||64||256||120|
|ROP Units||48 (?)||32||64||32||64||56||32||64||64|
|Memory Clock||8000 MHz||7000 MHz
|6000 MHz||5700 MHz||7000 MHz||7000 MHz||7000 MHz||500 MHz||8000 MHz|
|Memory Interface||192-bit||256-bit||512-bit||256-bit||256-bit||256-bit||128-bit||4096-bit (HBM)||256-bit|
|Memory Bandwidth||192 GB/s||224 GB/s
|384 GB/s||182.4 GB/s||224 GB/s||196 GB/s||112 GB/s||512 GB/s||256 GB/s|
|TDP||120 watts||150 watts||275 watts||190 watts||165 watts||145 watts||120 watts||275 watts||150 watts|
|Peak Compute||3.85 TFLOPS||5.1 TFLOPS||5.1 TFLOPS||3.48 TFLOPS||4.61 TFLOPS||3.4 TFLOPS||2.3 TFLOPS||8.19 TFLOPS||5.7 TFLOPS|
The GeForce GTX 1060 will sport 1280 CUDA cores with a GPU Boost clock speed rated at 1.7 GHz. Though the card will be available in only 6GB varieties, the reference / Founders Edition will ship with 6GB of GDDR5 memory running at 8.0 GHz / 8 Gbps. With 1280 CUDA cores, the GP106 GPU is essentially one half of a GP104 in terms of compute capability. NVIDIA decided not to cut the memory interface in half though, instead going with a 192-bit design compared to the GP104 and its 256-bit option.
The rated GPU clock speeds paint an interesting picture for peak performance of the new card. At the rated boost clock speed, the GeForce GTX 1070 produces 6.46 TFLOPS of performance. The GTX 1060 by comparison will hit 4.35 TFLOPS, a 48% difference. The GTX 1080 offers nearly the same delta of performance above the GTX 1070; clearly NVIDIA has set the scale Pascal and product deviation.
NVIDIA wants us to compare the new GeForce GTX 1060 to the GeForce GTX 980 in gaming performance, but the peak theoretical performance results don’t really match up. The GeForce GTX 980 is rated at 4.61 TFLOPS at BASE clock speed, while the GTX 1060 doesn’t hit that number at its Boost clock. Obviously Pascal improves on performance with memory compression advancements, but the 192-bit memory bus is only able to run at 192 GB/s, compared to the 224 GB/s of the GTX 980. Obviously we’ll have to wait for performance result from our own testing to be sure, but it seems possible that NVIDIA’s performance claims might depend on technology like Simultaneous Multi-Projection and VR gaming to be validated.
Too much power to the people?
UPDATE (7/1/16): I have added a third page to this story that looks at the power consumption and power draw of the ASUS GeForce GTX 960 Strix card. This card was pointed out by many readers on our site and on reddit as having the same problem as the Radeon RX 480. As it turns out...not so much. Check it out!
UPDATE 2 (7/2/16): We have an official statement from AMD this morning.
As you know, we continuously tune our GPUs in order to maximize their performance within their given power envelopes and the speed of the memory interface, which in this case is an unprecedented 8Gbps for GDDR5. Recently, we identified select scenarios where the tuning of some RX 480 boards was not optimal. Fortunately, we can adjust the GPU's tuning via software in order to resolve this issue. We are already testing a driver that implements a fix, and we will provide an update to the community on our progress on Tuesday (July 5, 2016).
Honestly, that doesn't tell us much. And AMD appears to be deflecting slightly by using words like "some RX 480 boards". I don't believe this is limited to a subset of cards, or review samples only. AMD does indicate that the 8 Gbps memory on the 8GB variant might be partially to blame - which is an interesting correlation to test out later. The company does promise a fix for the problem via a driver update on Tuesday - we'll be sure to give that a test and see what changes are measured in both performance and in power consumption.
The launch of the AMD Radeon RX 480 has generally been considered a success. Our review of the new reference card shows impressive gains in architectural efficiency, improved positioning against NVIDIA’s competing parts in the same price range as well as VR-ready gaming performance starting at $199 for the 4GB model. AMD has every right to be proud of the new product and should have this lone position until the GeForce product line brings a Pascal card down into the same price category.
If you read carefully through my review, there was some interesting data that cropped up around the power consumption and delivery on the new RX 480. Looking at our power consumption numbers, measured directly from the card, not from the wall, it was using slightly more than the 150 watt TDP it was advertised as. This was done at 1920x1080 and tested in both Rise of the Tomb Raider and The Witcher 3.
When overclocked, the results were even higher, approaching the 200 watt mark in Rise of the Tomb Raider!
A portion of the review over at Tom’s Hardware produced similar results but detailed the power consumption from the motherboard PCI Express connection versus the power provided by the 6-pin PCIe power cable. There has been a considerable amount of discussion in the community about the amount of power the RX 480 draws through the motherboard, whether it is out of spec and what kind of impact it might have on the stability or life of the PC the RX 480 is installed in.
As it turns out, we have the ability to measure the exact same kind of data, albeit through a different method than Tom’s, and wanted to see if the result we saw broke down in the same way.
Our Testing Methods
This is a complex topic so it makes sense to detail the methodology of our advanced power testing capability up front.
How do we do it? Simple in theory but surprisingly difficult in practice, we are intercepting the power being sent through the PCI Express bus as well as the ATX power connectors before they go to the graphics card and are directly measuring power draw with a 10 kHz DAQ (data acquisition) device. A huge thanks goes to Allyn for getting the setup up and running. We built a PCI Express bridge that is tapped to measure both 12V and 3.3V power and built some Corsair power cables that measure the 12V coming through those as well.
The result is data that looks like this.
What you are looking at here is the power measured from the GTX 1080. From time 0 to time 8 seconds or so, the system is idle, from 8 seconds to about 18 seconds Steam is starting up the title. From 18-26 seconds the game is at the menus, we load the game from 26-39 seconds and then we play through our benchmark run after that.
There are four lines drawn in the graph, the 12V and 3.3V results are from the PCI Express bus interface, while the one labeled PCIE is from the PCIE power connection from the power supply to the card. We have the ability to measure two power inputs there but because the GTX 1080 only uses a single 8-pin connector, there is only one shown here. Finally, the blue line is labeled total and is simply that: a total of the other measurements to get combined power draw and usage by the graphics card in question.
From this we can see a couple of interesting data points. First, the idle power of the GTX 1080 Founders Edition is only about 7.5 watts. Second, under a gaming load of Rise of the Tomb Raider, the card is pulling about 165-170 watts on average, though there are plenty of intermittent, spikes. Keep in mind we are sampling the power at 1000/s so this kind of behavior is more or less expected.
Different games and applications impose different loads on the GPU and can cause it to draw drastically different power. Even if a game runs slowly, it may not be drawing maximum power from the card if a certain system on the GPU (memory, shaders, ROPs) is bottlenecking other systems.
One interesting note on our data compared to what Tom’s Hardware presents – we are using a second order low pass filter to smooth out the data to make it more readable and more indicative of how power draw is handled by the components on the PCB. Tom’s story reported “maximum” power draw at 300 watts for the RX 480 and while that is technically accurate, those figures represent instantaneous power draw. That is interesting data in some circumstances, and may actually indicate other potential issues with excessively noisy power circuitry, but to us, it makes more sense to sample data at a high rate (10 kHz) but to filter it and present it more readable way that better meshes with the continuous power delivery capabilities of the system.
Image source: E2E Texas Instruments
An example of instantaneous voltage spikes on power supply phase changes
Some gamers have expressed concern over that “maximum” power draw of 300 watts on the RX 480 that Tom’s Hardware reported. While that power measurement is technically accurate, it doesn’t represent the continuous power draw of the hardware. Instead, that measure is a result of a high frequency data acquisition system that may take a reading at the exact moment that a power phase on the card switches. Any DC switching power supply that is riding close to a certain power level is going to exceed that on the leading edges of phase switches for some minute amount of time. This is another reason why our low pass filter on power data can help represent real-world power consumption accurately. That doesn’t mean the spikes they measure are not a potential cause for concern, that’s just not what we are focused on with our testing.
Polaris 10 Specifications
It would be hard at this point to NOT know about the Radeon RX 480 graphics card. AMD and the Radeon Technologies Group has been talking publicly about the Polaris architecture since December of 2015 with lofty ambitions. In the precarious position that the company rests, being well behind in market share and struggling to compete with the dominant player in the market (NVIDIA), the team was willing to sacrifice sales of current generation parts (300-series) in order to excite the user base for the upcoming move to Polaris. It is a risky bet and one that will play out over the next few months in the market.
Since then AMD continued to release bits of information at a time. First there were details on the new display support, then information about the 14nm process technology advantages. We then saw demos of working silicon at CES with targeted form factors and then at events in Macau, showed press the full details and architecture. At Computex they announced rough performance metrics and a price point. Finally, at E3, AMD discussed the RX 460 and RX 470 cousins and the release date of…today. It’s been quite a whirlwind.
Today the rubber meets the road: is the Radeon RX 480 the groundbreaking and stunning graphics card that we have been promised? Or does it struggle again to keep up with the behemoth that is NVIDIA’s GeForce product line? AMD’s marketing team would have you believe that the RX 480 is the start of some kind of graphics revolution – but will the coup be successful?
Join us for our second major graphics architecture release of the summer and learn for yourself if the Radeon RX 480 is your next GPU.
AMD gets aggressive
At its Computex 2016 press conference in Taipei today, AMD has announced the branding and pricing, along with basic specifications, for one of its upcoming Polaris GPUs shipping later this June. The Radeon RX 480, based on Polaris 10, will cost just $199 and will offer more than 5 TFLOPS of compute capability. This is an incredibly aggressive move obviously aimed at continuing to gain market share at NVIDIA's expense. Details of the product are listed below.
|RX 480||GTX 1070||GTX 980||GTX 970||R9 Fury||R9 Nano||R9 390X||R9 390|
|GPU||Polaris 10||GP104||GM204||GM204||Fiji Pro||Fiji XT||Hawaii XT||Grenada Pro|
|Rated Clock||?||1506 MHz||1126 MHz||1050 MHz||1000 MHz||up to 1000 MHz||1050 MHz||1000 MHz|
|Memory Clock||8000 MHz||8000 MHz||7000 MHz||7000 MHz||500 MHz||500 MHz||6000 MHz||6000 MHz|
|Memory Interface||256-bit||256-bit||256-bit||256-bit||4096-bit (HBM)||4096-bit (HBM)||512-bit||512-bit|
|Memory Bandwidth||256 GB/s||256 GB/s||224 GB/s||196 GB/s||512 GB/s||512 GB/s||384 GB/s||384 GB/s|
|TDP||150 watts||150 watts||165 watts||145 watts||275 watts||175 watts||275 watts||230 watts|
|Peak Compute||> 5.0 TFLOPS||5.7 TFLOPS||4.61 TFLOPS||3.4 TFLOPS||7.20 TFLOPS||8.19 TFLOPS||5.63 TFLOPS||5.12 TFLOPS|
The RX 480 will ship with 36 CUs totaling 2304 stream processors based on the current GCN breakdown of 64 stream processors per CU. AMD didn't list clock speeds and instead is only telling us that the performance offered will exceed 5 TFLOPS of compute; how much is still a mystery and will likely change based on final clocks.
The memory system is powered by a 256-bit GDDR5 memory controller running at 8 Gbps and hitting 256 GB/s of throughput. This is the same resulting memory bandwidth as NVIDIA's new GeForce GTX 1070 graphics card.
AMD also tells us that the TDP of the card is 150 watts, again matching the GTX 1070, though without more accurate performance data it's hard to assume anything about the new architectural efficiency of the Polaris GPUs built on the 14nm Global Foundries process.
Obviously the card will support FreeSync and all of AMD's VR features, in addition to being DP 1.3 and 1.4 ready.
AMD stated that the RX 480 will launch on June 29th.
I know that many of you will want us to start guessing at what performance level the new RX 480 will actually fall, and trust me, I've been trying to figure it out. Based on TFLOPS rating and memory bandwidth alone, it seems possible that the RX 480 could compete with the GTX 1070. But if that were the case, I don't think even AMD is crazy enough to set the price this far below where the GTX 1070 launched, $379.
I would expect the configuration of the GCN architecture to remain mostly unchanged on Polaris, compared to Hawaii, for the same reasons that we saw NVIDIA leave Pascal's basic compute architecture unchanged compared to Maxwell. Moving to the new process node was the primary goal and adding to that with drastic shifts in compute design might overly complicate product development.
In the past, we have observed that AMD's GCN architecture tends to operate slightly less efficiently in terms of rated maximum compute capability versus realized gaming performance, at least compared to Maxwell and now Pascal. With that in mind, the >5 TFLOPS offered by the RX 480 likely lies somewhere between the Radeon R9 390 and R9 390X in realized gaming output. If that is the case, the Radeon RX 480 should have performance somewhere between the GeForce GTX 970 and the GeForce GTX 980.
AMD claims that the RX 480 at $199 is set to offer a "premium VR experience" that has previously be limited to $500 graphics cards (another reference to the original price of the GTX 980 perhaps...). AMD claims this should have a dramatic impact on increasing the TAM (total addressable market) for VR.
In a notable market survey, price was a leading barrier to adoption of VR. The $199 SEP for select Radeon™ RX Series GPUs is an integral part of AMD’s strategy to dramatically accelerate VR adoption and unleash the VR software ecosystem. AMD expects that its aggressive pricing will jumpstart the growth of the addressable market for PC VR and accelerate the rate at which VR headsets drop in price:
- More affordable VR-ready desktops and notebooks
- Making VR accessible to consumers in retail
- Unleashing VR developers on a larger audience
- Reducing the cost of entry to VR
AMD calls this strategy of starting with the mid-range product its "Water Drop" strategy with the goal "at releasing new graphics architectures in high volume segments first to support continued market share growth for Radeon GPUs."
So what do you guys think? Are you impressed with what Polaris looks like its going to be now?
GP104 Strikes Again
It’s only been three weeks since NVIDIA unveiled the GeForce GTX 1080 and GTX 1070 graphics cards at a live streaming event in Austin, TX. But it feels like those two GPUs, one of which hasn't even been reviewed until today, have already drastically shifted the landscape of graphics, VR and PC gaming.
Half of the “new GPU” stories are told, with AMD due to follow up soon with Polaris, but it was clear to anyone watching the enthusiast segment with a hint of history that a line was drawn in the sand that day. There is THEN, and there is NOW. Today’s detailed review of the GeForce GTX 1070 completes NVIDIA’s first wave of NOW products, following closely behind the GeForce GTX 1080.
Interestingly, and in a move that is very uncharacteristic of NVIDIA, detailed specifications of the GeForce GTX 1070 were released on GeForce.com well before today’s reviews. With information on the CUDA core count, clock speeds, and memory bandwidth it was possible to get a solid sense of where the GTX 1070 performed; and I imagine that many of you already did the napkin math to figure that out. There is no more guessing though - reviews and testing are all done, and I think you'll find that the GTX 1070 is as exciting, if not more so, than the GTX 1080 due to the performance and pricing combination that it provides.
Let’s dive in.
First, Some Background
NVIDIA's Rumored GP102
When GP100 was announced, Josh and I were discussing, internally, how it would make sense in the gaming industry. Recently, an article on WCCFTech cited anonymous sources, which should always be taken with a dash of salt, that claimed NVIDIA was planning a second architecture, GP102, between GP104 and GP100. As I was writing this editorial about it, relating it to our own speculation about the physics of Pascal, VideoCardz claims to have been contacted by the developers of AIDA64, seemingly on-the-record, also citing a GP102 design.
I will retell chunks of the rumor, but also add my opinion to it.
In the last few generations, each architecture had a flagship chip that was released in both gaming and professional SKUs. Neither audience had access to a chip that was larger than the other's largest of that generation. Clock rates and disabled portions varied by specific product, with gaming usually getting the more aggressive performance for slightly better benchmarks. Fermi had GF100/GF110, Kepler had GK110/GK210, and Maxwell had GM200. Each of these were available in Tesla, Quadro, and GeForce cards, especially Titans.
Maxwell was interesting, though. NVIDIA was unable to leave 28nm, which Kepler launched on, so they created a second architecture at that node. To increase performance without having access to more feature density, you need to make your designs bigger, more optimized, or more simple. GM200 was giant and optimized, but, to get the performance levels it achieved, also needed to be more simple. Something needed to go, and double-precision (FP64) performance was the big omission. NVIDIA was upfront about it at the Titan X launch, and told their GPU compute customers to keep purchasing Kepler if they valued FP64.
A new architecture with GP104
Table of Contents
- Asynchronous compute discussion
- Is only 2-Way SLI supported?
- Overclocking over 2.0 GHz
- Dissecting the Founders Edition
- Benchmarks begin
- VR Testing
- Impressive power efficiency
- Performance per dollar discussion
- Ansel screenshot tool
The summer of change for GPUs has begun with today’s review of the GeForce GTX 1080. NVIDIA has endured leaks, speculation and criticism for months now, with enthusiasts calling out NVIDIA for not including HBM technology or for not having asynchronous compute capability. Last week NVIDIA’s CEO Jen-Hsun Huang went on stage and officially announced the GTX 1080 and GTX 1070 graphics cards with a healthy amount of information about their supposed performance and price points. Issues around cost and what exactly a Founders Edition is aside, the event was well received and clearly showed a performance and efficiency improvement that we were not expecting.
The question is, does the actual product live up to the hype? Can NVIDIA overcome some users’ negative view of the Founders Edition to create a product message that will get the wide range of PC gamers looking for an upgrade path an option they’ll take?
I’ll let you know through the course of this review, but what I can tell you definitively is that the GeForce GTX 1080 clearly sits alone at the top of the GPU world.
NVIDIA's Ansel Technology
“In-game photography” is an interesting concept. Not too long ago, it was difficult to just capture the user's direct experience with a title. Print screen could only hold a single screenshot at a time, which allowed Steam and FRAPS to provide a better user experience. FRAPS also made video more accessible to the end-user, but it output huge files and, while it wasn't too expensive, it needed to be purchased online, which was a big issue ten-or-so years ago.
Seeing that their audience would enjoy video captures, NVIDIA introduced ShadowPlay a couple of years ago. The feature allowed users to, not only record video, but also capture the last few minutes. It did this with hardware acceleration, and it did this for free (for compatible GPUs). While I don't use ShadowPlay, preferring the control of OBS, it's a good example of how NVIDIA wants to support their users. They see these features as a value-add, which draw people to their hardware.
History and Specifications
The Radeon Pro Duo had an interesting history. Originally shown as an unbranded, dual-GPU PCB during E3 2015, which took place last June, AMD touted it as the ultimate graphics card for both gamers and professionals. At that time, the company thought that an October launch was feasible, but that clearly didn’t work out. When pressed for information in the Oct/Nov timeframe, AMD said that they had delayed the product into Q2 2016 to better correlate with the launch of the VR systems from Oculus and HTC/Valve.
During a GDC press event in March, AMD finally unveiled the Radeon Pro Duo brand, but they were also walking back on the idea of the dual-Fiji beast being aimed at the gaming crowd, even partially. Instead, the company talked up the benefits for game developers and content creators, such as its 8192 stream processors for offline rendering, or even to aid game devs in the implementation and improvement of multi-GPU for upcoming games.
Anyone that pays attention to the graphics card market can see why AMD would make the positional shift with the Radeon Pro Duo. The Fiji architecture is on the way out, with Polaris due out in June by AMD’s own proclamation. At $1500, the Radeon Pro Duo will be a stark contrast to the prices of the Polaris GPUs this summer, and it is well above any NVIDIA-priced part in the GeForce line. And, though CrossFire has made drastic improvements over the last several years thanks to new testing techniques, the ecosystem for multi-GPU is going through a major shift with both DX12 and VR bearing down on it.
So yes, the Radeon Pro Duo has both RADEON and PRO right there in the name. What’s a respectable PC Perspective graphics reviewer supposed to do with a card like that if it finds its way into your office? Test it of course! I’ll take a look at a handful of recent games as well as a new feature that AMD has integrated with 3DS Max called FireRender to showcase some of the professional chops of the new card.
The Dual-Fiji Card Finally Arrives
This weekend, leaks of information on both WCCFTech and VideoCardz.com have revealed all the information about the pending release of AMD’s dual-GPU giant, the Radeon Pro Duo. While no one at PC Perspective has been briefed on the product officially, all of the interesting data surrounding the product is clearly outlined in the slides on those websites, minus some independent benchmark testing that we are hoping to get to next week. Based on the report from both sites, the Radeon Pro Duo will be released on April 26th.
AMD actually revealed the product and branding for the Radeon Pro Duo back in March, during its live streamed Capsaicin event surrounding GDC. At that point we were given the following information:
- Dual Fiji XT GPUs
- 8GB of total HBM memory
- 4x DisplayPort (this has since been modified)
- 16 TFLOPS of compute
- $1499 price tag
The design of the card follows the same industrial design as the reference designs of the Radeon Fury X, and integrates a dual-pump cooler and external fan/radiator to keep both GPUs running cool.
Based on the slides leaked out today, AMD has revised the Radeon Pro Duo design to include a set of three DisplayPort connections and one HDMI port. This was a necessary change as the Oculus Rift requires an HDMI port to work; only the HTC Vive has built in support for a DisplayPort connection and even in that case you would need a full-size to mini-DisplayPort cable.
The 8GB of HBM (high bandwidth memory) on the card is split between the two Fiji XT GPUs on the card, just like other multi-GPU options on the market. The 350 watts power draw mark is exceptionally high, exceeded only by AMD’s previous dual-GPU beast, the Radeon 295X2 that used 500+ watts and the NVIDIA GeForce GTX Titan Z that draws 375 watts!
Here is the specification breakdown of the Radeon Pro Duo. The card has 8192 total stream processors and 128 Compute Units, split evenly between the two GPUs. You are getting two full Fiji XT GPUs in this card, an impressive feat made possible in part by the use of High Bandwidth Memory and its smaller physical footprint.
|Radeon Pro Duo||R9 Nano||R9 Fury||R9 Fury X||GTX 980 Ti||TITAN X||GTX 980||R9 290X|
|GPU||Fiji XT x 2||Fiji XT||Fiji Pro||Fiji XT||GM200||GM200||GM204||Hawaii XT|
|Rated Clock||up to 1000 MHz||up to 1000 MHz||1000 MHz||1050 MHz||1000 MHz||1000 MHz||1126 MHz||1000 MHz|
|Memory||8GB (4GB x 2)||4GB||4GB||4GB||6GB||12GB||4GB||4GB|
|Memory Clock||500 MHz||500 MHz||500 MHz||500 MHz||7000 MHz||7000 MHz||7000 MHz||5000 MHz|
|Memory Interface||4096-bit (HMB) x 2||4096-bit (HBM)||4096-bit (HBM)||4096-bit (HBM)||384-bit||384-bit||256-bit||512-bit|
|Memory Bandwidth||1024 GB/s||512 GB/s||512 GB/s||512 GB/s||336 GB/s||336 GB/s||224 GB/s||320 GB/s|
|TDP||350 watts||175 watts||275 watts||275 watts||250 watts||250 watts||165 watts||290 watts|
|Peak Compute||16.38 TFLOPS||8.19 TFLOPS||7.20 TFLOPS||8.60 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||5.63 TFLOPS|
|Transistor Count||8.9B x 2||8.9B||8.9B||8.9B||8.0B||8.0B||5.2B||6.2B|
The Radeon Pro Duo has a rated clock speed of up to 1000 MHz. That’s the same clock speed as the R9 Fury and the rated “up to” frequency on the R9 Nano. It’s worth noting that we did see a handful of instances where the R9 Nano’s power limiting capability resulted in some extremely variable clock speeds in practice. AMD recently added a feature to its Crimson driver to disable power metering on the Nano, at the expense of more power draw, and I would assume the same option would work for the Pro Duo.
93% of a GP100 at least...
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.