All | Editorial | General Tech | Graphics Cards | Networking | Motherboards | Cases and Cooling | Processors | Chipsets | Memory | Displays | Systems | Storage | Mobile | Shows and Expos
Twelve days ago, NVIDIA announced its competitor to the AMD Radeon RX 480, the GeForce GTX 1060, based on a new Pascal GPU; GP 106. Though that story was just a brief preview of the product, and a pictorial of the GTX 1060 Founders Edition card we were initially sent, it set the community ablaze with discussion around which mainstream enthusiast platform was going to be the best for gamers this summer.
Today we are allowed to show you our full review: benchmarks of the new GeForce GTX 1060 against the likes of the Radeon RX 480, the GTX 970 and GTX 980, and more. Starting at $250, the GTX 1060 has the potential to be the best bargain in the market today, though much of that will be decided based on product availability and our results on the following pages.
Does NVIDIA’s third consumer product based on Pascal make enough of an impact to dissuade gamers from buying into AMD Polaris?
All signs point to a bloody battle this July and August and the retail cards based on the GTX 1060 are making their way to our offices sooner than even those based around the RX 480. It is those cards, and not the reference/Founders Edition option, that will be the real competition that AMD has to go up against.
First, however, it’s important to find our baseline: where does the GeForce GTX 1060 find itself in the wide range of GPUs?
Through the looking glass
Futuremark has been the most consistent and most utilized benchmark company for PCs for quite a long time. While other companies have faltered and faded, Futuremark continues to push forward with new benchmarks and capabilities in an attempt to maintain a modern way to compare performance across platforms with standardized tests.
Back in March of 2015, 3DMark added support for an API Overhead test to help gamers and editors understand the performance advantages of Mantle and DirectX 12 compared to existing APIs. Though the results were purely “peak theoretical” numbers, the data helped showcase to consumers and developers what low levels APIs brought to the table.
Today Futuremark is releasing a new benchmark that focuses on DX12 gaming. No longer just a feature test, Time Spy is a fully baked benchmark with its own rendering engine and scenarios for evaluating the performance of graphics cards and platforms. It requires Windows 10 and a DX12-capable graphics card, and includes two different graphics tests and a CPU test. Oh, and of course, there is a stunningly gorgeous demo mode to go along with it.
I’m not going to spend much time here dissecting the benchmark itself, but it does make sense to have an idea of what kind of technologies are built into the game engine and tests. The engine is based purely on DX12, and integrates technologies like asynchronous compute, explicit multi-adapter and multi-threaded workloads. These are highly topical ideas and will be the focus of my testing today.
Futuremark provides an interesting diagram to demonstrate the advantages DX12 has over DX11. Below you will find a listing of the average number of vertices, triangles, patches and shader calls in 3DMark Fire Strike compared with 3DMark Time Spy.
It’s not even close here – the new Time Spy engine has more than a factor of 10 more processing calls for some of these items. As Futuremark states, however, this kind of capability isn’t free.
With DirectX 12, developers can significantly improve the multi-thread scaling and hardware utilization of their titles. But it requires a considerable amount of graphics expertise and memory-level programming skill. The programming investment is significant and must be considered from the start of a project.
Radeon Software 16.7.1 Adjustments
Last week we posted a story that looked at a problem found with the new AMD Radeon RX 480 graphics card’s power consumption. The short version of the issue was that AMD’s new Polaris 10-based reference card was drawing more power than its stated 150 watt TDP and that it was drawing more power through the motherboard PCI Express slot that the connection was rated for. And sometimes that added power draw was significant, both at stock settings and overclocked. Seeing current draw over a connection rated at just 5.5A peaking over 7A at stock settings raised an alarm (validly) and our initial report detailed the problem very specifically.
AMD responded initially that “everything was fine here” but the company eventually saw the writing on the wall and started to work on potential solutions. The Radeon RX 480 is a very important product for the future of Radeon graphics and this was a launch that needs to be as perfect as it can be. Though the risk to users’ hardware with the higher than expected current draw is muted somewhat by motherboard-based over-current protection, it’s crazy to think that AMD actually believed that was the ideal scenario. Depending on the “circuit breaker” in any system to save you when standards exists for exactly that purpose is nuts.
Today AMD has released a new driver, version 16.7.1, that actually introduces a pair of fixes for the problem. One of them is hard coded into the software and adjusts power draw from the different +12V sources (PCI Express slot and 6-pin connector) while the other is an optional flag in the software that is disabled by default.
Reconfiguring the power phase controller
The Radeon RX 480 uses a very common power controller (IR3567B) on its PCB to cycle through the 6 power phases providing electricity to the GPU itself. Allyn did some simple multimeter trace work to tell us which phases were connected to which sources and the result is seen below.
The power controller is responsible for pacing the power coming in from the PCI Express slot and the 6-pin power connection to the GPU, in phases. Phases 1-3 come in from the power supply via the 6-pin connection, while phases 4-6 source power from the motherboard directly. At launch, the RX 480 drew nearly identical amounts of power from both the PEG slot and the 6-pin connection, essentially giving each of the 6 phases at work equal time.
That might seem okay, but it’s far from the standard of what we have seen in the past. In no other case have we measured a graphics card drawing equal power from the PEG slot as from an external power connector on the card. (Obviously for cards without external power connections, that’s a different discussion.) In general, with other AMD and NVIDIA based graphics cards, the motherboard slot would provide no more than 50-60 watts of power, while any above that would come from the 6/8-pin connections on the card. In many cases I saw that power draw through the PEG slot was as low as 20-30 watts if the external power connections provided a lot of overage for the target TDP of the product.
It’s probably not going to come as a surprise to anyone that reads the internet, but NVIDIA is officially taking the covers off its latest GeForce card in the Pascal family today, the GeForce GTX 1060. As the number scheme would suggest, this is a more budget-friendly version of NVIDIA’s latest architecture, lowering performance in line with expectations. The GP106-based GPU will still offer impressive specifications and capabilities and will probably push AMD’s new Radeon RX 480 to its limits.
Let’s take a quick look at the card’s details.
|GTX 1060||RX 480||R9 390||R9 380||GTX 980||GTX 970||GTX 960||R9 Nano||GTX 1070|
|GPU||GP106||Polaris 10||Grenada||Tonga||GM204||GM204||GM206||Fiji XT||GP104|
|Rated Clock||1506 MHz||1120 MHz||1000 MHz||970 MHz||1126 MHz||1050 MHz||1126 MHz||up to 1000 MHz||1506 MHz|
|Texture Units||80 (?)||144||160||112||128||104||64||256||120|
|ROP Units||48 (?)||32||64||32||64||56||32||64||64|
|Memory Clock||8000 MHz||7000 MHz
|6000 MHz||5700 MHz||7000 MHz||7000 MHz||7000 MHz||500 MHz||8000 MHz|
|Memory Interface||192-bit||256-bit||512-bit||256-bit||256-bit||256-bit||128-bit||4096-bit (HBM)||256-bit|
|Memory Bandwidth||192 GB/s||224 GB/s
|384 GB/s||182.4 GB/s||224 GB/s||196 GB/s||112 GB/s||512 GB/s||256 GB/s|
|TDP||120 watts||150 watts||275 watts||190 watts||165 watts||145 watts||120 watts||275 watts||150 watts|
|Peak Compute||3.85 TFLOPS||5.1 TFLOPS||5.1 TFLOPS||3.48 TFLOPS||4.61 TFLOPS||3.4 TFLOPS||2.3 TFLOPS||8.19 TFLOPS||5.7 TFLOPS|
The GeForce GTX 1060 will sport 1280 CUDA cores with a GPU Boost clock speed rated at 1.7 GHz. Though the card will be available in only 6GB varieties, the reference / Founders Edition will ship with 6GB of GDDR5 memory running at 8.0 GHz / 8 Gbps. With 1280 CUDA cores, the GP106 GPU is essentially one half of a GP104 in terms of compute capability. NVIDIA decided not to cut the memory interface in half though, instead going with a 192-bit design compared to the GP104 and its 256-bit option.
The rated GPU clock speeds paint an interesting picture for peak performance of the new card. At the rated boost clock speed, the GeForce GTX 1070 produces 6.46 TFLOPS of performance. The GTX 1060 by comparison will hit 4.35 TFLOPS, a 48% difference. The GTX 1080 offers nearly the same delta of performance above the GTX 1070; clearly NVIDIA has set the scale Pascal and product deviation.
NVIDIA wants us to compare the new GeForce GTX 1060 to the GeForce GTX 980 in gaming performance, but the peak theoretical performance results don’t really match up. The GeForce GTX 980 is rated at 4.61 TFLOPS at BASE clock speed, while the GTX 1060 doesn’t hit that number at its Boost clock. Obviously Pascal improves on performance with memory compression advancements, but the 192-bit memory bus is only able to run at 192 GB/s, compared to the 224 GB/s of the GTX 980. Obviously we’ll have to wait for performance result from our own testing to be sure, but it seems possible that NVIDIA’s performance claims might depend on technology like Simultaneous Multi-Projection and VR gaming to be validated.
Too much power to the people?
UPDATE (7/1/16): I have added a third page to this story that looks at the power consumption and power draw of the ASUS GeForce GTX 960 Strix card. This card was pointed out by many readers on our site and on reddit as having the same problem as the Radeon RX 480. As it turns out...not so much. Check it out!
UPDATE 2 (7/2/16): We have an official statement from AMD this morning.
As you know, we continuously tune our GPUs in order to maximize their performance within their given power envelopes and the speed of the memory interface, which in this case is an unprecedented 8Gbps for GDDR5. Recently, we identified select scenarios where the tuning of some RX 480 boards was not optimal. Fortunately, we can adjust the GPU's tuning via software in order to resolve this issue. We are already testing a driver that implements a fix, and we will provide an update to the community on our progress on Tuesday (July 5, 2016).
Honestly, that doesn't tell us much. And AMD appears to be deflecting slightly by using words like "some RX 480 boards". I don't believe this is limited to a subset of cards, or review samples only. AMD does indicate that the 8 Gbps memory on the 8GB variant might be partially to blame - which is an interesting correlation to test out later. The company does promise a fix for the problem via a driver update on Tuesday - we'll be sure to give that a test and see what changes are measured in both performance and in power consumption.
The launch of the AMD Radeon RX 480 has generally been considered a success. Our review of the new reference card shows impressive gains in architectural efficiency, improved positioning against NVIDIA’s competing parts in the same price range as well as VR-ready gaming performance starting at $199 for the 4GB model. AMD has every right to be proud of the new product and should have this lone position until the GeForce product line brings a Pascal card down into the same price category.
If you read carefully through my review, there was some interesting data that cropped up around the power consumption and delivery on the new RX 480. Looking at our power consumption numbers, measured directly from the card, not from the wall, it was using slightly more than the 150 watt TDP it was advertised as. This was done at 1920x1080 and tested in both Rise of the Tomb Raider and The Witcher 3.
When overclocked, the results were even higher, approaching the 200 watt mark in Rise of the Tomb Raider!
A portion of the review over at Tom’s Hardware produced similar results but detailed the power consumption from the motherboard PCI Express connection versus the power provided by the 6-pin PCIe power cable. There has been a considerable amount of discussion in the community about the amount of power the RX 480 draws through the motherboard, whether it is out of spec and what kind of impact it might have on the stability or life of the PC the RX 480 is installed in.
As it turns out, we have the ability to measure the exact same kind of data, albeit through a different method than Tom’s, and wanted to see if the result we saw broke down in the same way.
Our Testing Methods
This is a complex topic so it makes sense to detail the methodology of our advanced power testing capability up front.
How do we do it? Simple in theory but surprisingly difficult in practice, we are intercepting the power being sent through the PCI Express bus as well as the ATX power connectors before they go to the graphics card and are directly measuring power draw with a 10 kHz DAQ (data acquisition) device. A huge thanks goes to Allyn for getting the setup up and running. We built a PCI Express bridge that is tapped to measure both 12V and 3.3V power and built some Corsair power cables that measure the 12V coming through those as well.
The result is data that looks like this.
What you are looking at here is the power measured from the GTX 1080. From time 0 to time 8 seconds or so, the system is idle, from 8 seconds to about 18 seconds Steam is starting up the title. From 18-26 seconds the game is at the menus, we load the game from 26-39 seconds and then we play through our benchmark run after that.
There are four lines drawn in the graph, the 12V and 3.3V results are from the PCI Express bus interface, while the one labeled PCIE is from the PCIE power connection from the power supply to the card. We have the ability to measure two power inputs there but because the GTX 1080 only uses a single 8-pin connector, there is only one shown here. Finally, the blue line is labeled total and is simply that: a total of the other measurements to get combined power draw and usage by the graphics card in question.
From this we can see a couple of interesting data points. First, the idle power of the GTX 1080 Founders Edition is only about 7.5 watts. Second, under a gaming load of Rise of the Tomb Raider, the card is pulling about 165-170 watts on average, though there are plenty of intermittent, spikes. Keep in mind we are sampling the power at 1000/s so this kind of behavior is more or less expected.
Different games and applications impose different loads on the GPU and can cause it to draw drastically different power. Even if a game runs slowly, it may not be drawing maximum power from the card if a certain system on the GPU (memory, shaders, ROPs) is bottlenecking other systems.
One interesting note on our data compared to what Tom’s Hardware presents – we are using a second order low pass filter to smooth out the data to make it more readable and more indicative of how power draw is handled by the components on the PCB. Tom’s story reported “maximum” power draw at 300 watts for the RX 480 and while that is technically accurate, those figures represent instantaneous power draw. That is interesting data in some circumstances, and may actually indicate other potential issues with excessively noisy power circuitry, but to us, it makes more sense to sample data at a high rate (10 kHz) but to filter it and present it more readable way that better meshes with the continuous power delivery capabilities of the system.
Image source: E2E Texas Instruments
An example of instantaneous voltage spikes on power supply phase changes
Some gamers have expressed concern over that “maximum” power draw of 300 watts on the RX 480 that Tom’s Hardware reported. While that power measurement is technically accurate, it doesn’t represent the continuous power draw of the hardware. Instead, that measure is a result of a high frequency data acquisition system that may take a reading at the exact moment that a power phase on the card switches. Any DC switching power supply that is riding close to a certain power level is going to exceed that on the leading edges of phase switches for some minute amount of time. This is another reason why our low pass filter on power data can help represent real-world power consumption accurately. That doesn’t mean the spikes they measure are not a potential cause for concern, that’s just not what we are focused on with our testing.
Polaris 10 Specifications
It would be hard at this point to NOT know about the Radeon RX 480 graphics card. AMD and the Radeon Technologies Group has been talking publicly about the Polaris architecture since December of 2015 with lofty ambitions. In the precarious position that the company rests, being well behind in market share and struggling to compete with the dominant player in the market (NVIDIA), the team was willing to sacrifice sales of current generation parts (300-series) in order to excite the user base for the upcoming move to Polaris. It is a risky bet and one that will play out over the next few months in the market.
Since then AMD continued to release bits of information at a time. First there were details on the new display support, then information about the 14nm process technology advantages. We then saw demos of working silicon at CES with targeted form factors and then at events in Macau, showed press the full details and architecture. At Computex they announced rough performance metrics and a price point. Finally, at E3, AMD discussed the RX 460 and RX 470 cousins and the release date of…today. It’s been quite a whirlwind.
Today the rubber meets the road: is the Radeon RX 480 the groundbreaking and stunning graphics card that we have been promised? Or does it struggle again to keep up with the behemoth that is NVIDIA’s GeForce product line? AMD’s marketing team would have you believe that the RX 480 is the start of some kind of graphics revolution – but will the coup be successful?
Join us for our second major graphics architecture release of the summer and learn for yourself if the Radeon RX 480 is your next GPU.
AMD gets aggressive
At its Computex 2016 press conference in Taipei today, AMD has announced the branding and pricing, along with basic specifications, for one of its upcoming Polaris GPUs shipping later this June. The Radeon RX 480, based on Polaris 10, will cost just $199 and will offer more than 5 TFLOPS of compute capability. This is an incredibly aggressive move obviously aimed at continuing to gain market share at NVIDIA's expense. Details of the product are listed below.
|RX 480||GTX 1070||GTX 980||GTX 970||R9 Fury||R9 Nano||R9 390X||R9 390|
|GPU||Polaris 10||GP104||GM204||GM204||Fiji Pro||Fiji XT||Hawaii XT||Grenada Pro|
|Rated Clock||?||1506 MHz||1126 MHz||1050 MHz||1000 MHz||up to 1000 MHz||1050 MHz||1000 MHz|
|Memory Clock||8000 MHz||8000 MHz||7000 MHz||7000 MHz||500 MHz||500 MHz||6000 MHz||6000 MHz|
|Memory Interface||256-bit||256-bit||256-bit||256-bit||4096-bit (HBM)||4096-bit (HBM)||512-bit||512-bit|
|Memory Bandwidth||256 GB/s||256 GB/s||224 GB/s||196 GB/s||512 GB/s||512 GB/s||384 GB/s||384 GB/s|
|TDP||150 watts||150 watts||165 watts||145 watts||275 watts||175 watts||275 watts||230 watts|
|Peak Compute||> 5.0 TFLOPS||5.7 TFLOPS||4.61 TFLOPS||3.4 TFLOPS||7.20 TFLOPS||8.19 TFLOPS||5.63 TFLOPS||5.12 TFLOPS|
The RX 480 will ship with 36 CUs totaling 2304 stream processors based on the current GCN breakdown of 64 stream processors per CU. AMD didn't list clock speeds and instead is only telling us that the performance offered will exceed 5 TFLOPS of compute; how much is still a mystery and will likely change based on final clocks.
The memory system is powered by a 256-bit GDDR5 memory controller running at 8 Gbps and hitting 256 GB/s of throughput. This is the same resulting memory bandwidth as NVIDIA's new GeForce GTX 1070 graphics card.
AMD also tells us that the TDP of the card is 150 watts, again matching the GTX 1070, though without more accurate performance data it's hard to assume anything about the new architectural efficiency of the Polaris GPUs built on the 14nm Global Foundries process.
Obviously the card will support FreeSync and all of AMD's VR features, in addition to being DP 1.3 and 1.4 ready.
AMD stated that the RX 480 will launch on June 29th.
I know that many of you will want us to start guessing at what performance level the new RX 480 will actually fall, and trust me, I've been trying to figure it out. Based on TFLOPS rating and memory bandwidth alone, it seems possible that the RX 480 could compete with the GTX 1070. But if that were the case, I don't think even AMD is crazy enough to set the price this far below where the GTX 1070 launched, $379.
I would expect the configuration of the GCN architecture to remain mostly unchanged on Polaris, compared to Hawaii, for the same reasons that we saw NVIDIA leave Pascal's basic compute architecture unchanged compared to Maxwell. Moving to the new process node was the primary goal and adding to that with drastic shifts in compute design might overly complicate product development.
In the past, we have observed that AMD's GCN architecture tends to operate slightly less efficiently in terms of rated maximum compute capability versus realized gaming performance, at least compared to Maxwell and now Pascal. With that in mind, the >5 TFLOPS offered by the RX 480 likely lies somewhere between the Radeon R9 390 and R9 390X in realized gaming output. If that is the case, the Radeon RX 480 should have performance somewhere between the GeForce GTX 970 and the GeForce GTX 980.
AMD claims that the RX 480 at $199 is set to offer a "premium VR experience" that has previously be limited to $500 graphics cards (another reference to the original price of the GTX 980 perhaps...). AMD claims this should have a dramatic impact on increasing the TAM (total addressable market) for VR.
In a notable market survey, price was a leading barrier to adoption of VR. The $199 SEP for select Radeon™ RX Series GPUs is an integral part of AMD’s strategy to dramatically accelerate VR adoption and unleash the VR software ecosystem. AMD expects that its aggressive pricing will jumpstart the growth of the addressable market for PC VR and accelerate the rate at which VR headsets drop in price:
- More affordable VR-ready desktops and notebooks
- Making VR accessible to consumers in retail
- Unleashing VR developers on a larger audience
- Reducing the cost of entry to VR
AMD calls this strategy of starting with the mid-range product its "Water Drop" strategy with the goal "at releasing new graphics architectures in high volume segments first to support continued market share growth for Radeon GPUs."
So what do you guys think? Are you impressed with what Polaris looks like its going to be now?
GP104 Strikes Again
It’s only been three weeks since NVIDIA unveiled the GeForce GTX 1080 and GTX 1070 graphics cards at a live streaming event in Austin, TX. But it feels like those two GPUs, one of which hasn't even been reviewed until today, have already drastically shifted the landscape of graphics, VR and PC gaming.
Half of the “new GPU” stories are told, with AMD due to follow up soon with Polaris, but it was clear to anyone watching the enthusiast segment with a hint of history that a line was drawn in the sand that day. There is THEN, and there is NOW. Today’s detailed review of the GeForce GTX 1070 completes NVIDIA’s first wave of NOW products, following closely behind the GeForce GTX 1080.
Interestingly, and in a move that is very uncharacteristic of NVIDIA, detailed specifications of the GeForce GTX 1070 were released on GeForce.com well before today’s reviews. With information on the CUDA core count, clock speeds, and memory bandwidth it was possible to get a solid sense of where the GTX 1070 performed; and I imagine that many of you already did the napkin math to figure that out. There is no more guessing though - reviews and testing are all done, and I think you'll find that the GTX 1070 is as exciting, if not more so, than the GTX 1080 due to the performance and pricing combination that it provides.
Let’s dive in.
First, Some Background
NVIDIA's Rumored GP102
When GP100 was announced, Josh and I were discussing, internally, how it would make sense in the gaming industry. Recently, an article on WCCFTech cited anonymous sources, which should always be taken with a dash of salt, that claimed NVIDIA was planning a second architecture, GP102, between GP104 and GP100. As I was writing this editorial about it, relating it to our own speculation about the physics of Pascal, VideoCardz claims to have been contacted by the developers of AIDA64, seemingly on-the-record, also citing a GP102 design.
I will retell chunks of the rumor, but also add my opinion to it.
In the last few generations, each architecture had a flagship chip that was released in both gaming and professional SKUs. Neither audience had access to a chip that was larger than the other's largest of that generation. Clock rates and disabled portions varied by specific product, with gaming usually getting the more aggressive performance for slightly better benchmarks. Fermi had GF100/GF110, Kepler had GK110/GK210, and Maxwell had GM200. Each of these were available in Tesla, Quadro, and GeForce cards, especially Titans.
Maxwell was interesting, though. NVIDIA was unable to leave 28nm, which Kepler launched on, so they created a second architecture at that node. To increase performance without having access to more feature density, you need to make your designs bigger, more optimized, or more simple. GM200 was giant and optimized, but, to get the performance levels it achieved, also needed to be more simple. Something needed to go, and double-precision (FP64) performance was the big omission. NVIDIA was upfront about it at the Titan X launch, and told their GPU compute customers to keep purchasing Kepler if they valued FP64.
A new architecture with GP104
Table of Contents
- Asynchronous compute discussion
- Is only 2-Way SLI supported?
- Overclocking over 2.0 GHz
- Dissecting the Founders Edition
- Benchmarks begin
- VR Testing
- Impressive power efficiency
- Performance per dollar discussion
- Ansel screenshot tool
The summer of change for GPUs has begun with today’s review of the GeForce GTX 1080. NVIDIA has endured leaks, speculation and criticism for months now, with enthusiasts calling out NVIDIA for not including HBM technology or for not having asynchronous compute capability. Last week NVIDIA’s CEO Jen-Hsun Huang went on stage and officially announced the GTX 1080 and GTX 1070 graphics cards with a healthy amount of information about their supposed performance and price points. Issues around cost and what exactly a Founders Edition is aside, the event was well received and clearly showed a performance and efficiency improvement that we were not expecting.
The question is, does the actual product live up to the hype? Can NVIDIA overcome some users’ negative view of the Founders Edition to create a product message that will get the wide range of PC gamers looking for an upgrade path an option they’ll take?
I’ll let you know through the course of this review, but what I can tell you definitively is that the GeForce GTX 1080 clearly sits alone at the top of the GPU world.
NVIDIA's Ansel Technology
“In-game photography” is an interesting concept. Not too long ago, it was difficult to just capture the user's direct experience with a title. Print screen could only hold a single screenshot at a time, which allowed Steam and FRAPS to provide a better user experience. FRAPS also made video more accessible to the end-user, but it output huge files and, while it wasn't too expensive, it needed to be purchased online, which was a big issue ten-or-so years ago.
Seeing that their audience would enjoy video captures, NVIDIA introduced ShadowPlay a couple of years ago. The feature allowed users to, not only record video, but also capture the last few minutes. It did this with hardware acceleration, and it did this for free (for compatible GPUs). While I don't use ShadowPlay, preferring the control of OBS, it's a good example of how NVIDIA wants to support their users. They see these features as a value-add, which draw people to their hardware.
History and Specifications
The Radeon Pro Duo had an interesting history. Originally shown as an unbranded, dual-GPU PCB during E3 2015, which took place last June, AMD touted it as the ultimate graphics card for both gamers and professionals. At that time, the company thought that an October launch was feasible, but that clearly didn’t work out. When pressed for information in the Oct/Nov timeframe, AMD said that they had delayed the product into Q2 2016 to better correlate with the launch of the VR systems from Oculus and HTC/Valve.
During a GDC press event in March, AMD finally unveiled the Radeon Pro Duo brand, but they were also walking back on the idea of the dual-Fiji beast being aimed at the gaming crowd, even partially. Instead, the company talked up the benefits for game developers and content creators, such as its 8192 stream processors for offline rendering, or even to aid game devs in the implementation and improvement of multi-GPU for upcoming games.
Anyone that pays attention to the graphics card market can see why AMD would make the positional shift with the Radeon Pro Duo. The Fiji architecture is on the way out, with Polaris due out in June by AMD’s own proclamation. At $1500, the Radeon Pro Duo will be a stark contrast to the prices of the Polaris GPUs this summer, and it is well above any NVIDIA-priced part in the GeForce line. And, though CrossFire has made drastic improvements over the last several years thanks to new testing techniques, the ecosystem for multi-GPU is going through a major shift with both DX12 and VR bearing down on it.
So yes, the Radeon Pro Duo has both RADEON and PRO right there in the name. What’s a respectable PC Perspective graphics reviewer supposed to do with a card like that if it finds its way into your office? Test it of course! I’ll take a look at a handful of recent games as well as a new feature that AMD has integrated with 3DS Max called FireRender to showcase some of the professional chops of the new card.
The Dual-Fiji Card Finally Arrives
This weekend, leaks of information on both WCCFTech and VideoCardz.com have revealed all the information about the pending release of AMD’s dual-GPU giant, the Radeon Pro Duo. While no one at PC Perspective has been briefed on the product officially, all of the interesting data surrounding the product is clearly outlined in the slides on those websites, minus some independent benchmark testing that we are hoping to get to next week. Based on the report from both sites, the Radeon Pro Duo will be released on April 26th.
AMD actually revealed the product and branding for the Radeon Pro Duo back in March, during its live streamed Capsaicin event surrounding GDC. At that point we were given the following information:
- Dual Fiji XT GPUs
- 8GB of total HBM memory
- 4x DisplayPort (this has since been modified)
- 16 TFLOPS of compute
- $1499 price tag
The design of the card follows the same industrial design as the reference designs of the Radeon Fury X, and integrates a dual-pump cooler and external fan/radiator to keep both GPUs running cool.
Based on the slides leaked out today, AMD has revised the Radeon Pro Duo design to include a set of three DisplayPort connections and one HDMI port. This was a necessary change as the Oculus Rift requires an HDMI port to work; only the HTC Vive has built in support for a DisplayPort connection and even in that case you would need a full-size to mini-DisplayPort cable.
The 8GB of HBM (high bandwidth memory) on the card is split between the two Fiji XT GPUs on the card, just like other multi-GPU options on the market. The 350 watts power draw mark is exceptionally high, exceeded only by AMD’s previous dual-GPU beast, the Radeon 295X2 that used 500+ watts and the NVIDIA GeForce GTX Titan Z that draws 375 watts!
Here is the specification breakdown of the Radeon Pro Duo. The card has 8192 total stream processors and 128 Compute Units, split evenly between the two GPUs. You are getting two full Fiji XT GPUs in this card, an impressive feat made possible in part by the use of High Bandwidth Memory and its smaller physical footprint.
|Radeon Pro Duo||R9 Nano||R9 Fury||R9 Fury X||GTX 980 Ti||TITAN X||GTX 980||R9 290X|
|GPU||Fiji XT x 2||Fiji XT||Fiji Pro||Fiji XT||GM200||GM200||GM204||Hawaii XT|
|Rated Clock||up to 1000 MHz||up to 1000 MHz||1000 MHz||1050 MHz||1000 MHz||1000 MHz||1126 MHz||1000 MHz|
|Memory||8GB (4GB x 2)||4GB||4GB||4GB||6GB||12GB||4GB||4GB|
|Memory Clock||500 MHz||500 MHz||500 MHz||500 MHz||7000 MHz||7000 MHz||7000 MHz||5000 MHz|
|Memory Interface||4096-bit (HMB) x 2||4096-bit (HBM)||4096-bit (HBM)||4096-bit (HBM)||384-bit||384-bit||256-bit||512-bit|
|Memory Bandwidth||1024 GB/s||512 GB/s||512 GB/s||512 GB/s||336 GB/s||336 GB/s||224 GB/s||320 GB/s|
|TDP||350 watts||175 watts||275 watts||275 watts||250 watts||250 watts||165 watts||290 watts|
|Peak Compute||16.38 TFLOPS||8.19 TFLOPS||7.20 TFLOPS||8.60 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||5.63 TFLOPS|
|Transistor Count||8.9B x 2||8.9B||8.9B||8.9B||8.0B||8.0B||5.2B||6.2B|
The Radeon Pro Duo has a rated clock speed of up to 1000 MHz. That’s the same clock speed as the R9 Fury and the rated “up to” frequency on the R9 Nano. It’s worth noting that we did see a handful of instances where the R9 Nano’s power limiting capability resulted in some extremely variable clock speeds in practice. AMD recently added a feature to its Crimson driver to disable power metering on the Nano, at the expense of more power draw, and I would assume the same option would work for the Pro Duo.
93% of a GP100 at least...
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.
Why things are different in VR performance testing
It has been an interesting past several weeks and I find myself in an interesting spot. Clearly, and without a shred of doubt, virtual reality, more than any other gaming platform that has come before it, needs an accurate measure of performance and experience. With traditional PC gaming, if you dropped a couple of frames, or saw a slightly out of sync animation, you might notice and get annoyed. But in VR, with a head-mounted display just inches from your face taking up your entire field of view, a hitch in frame or a stutter in motion can completely ruin the immersive experience that the game developer is aiming to provide. Even worse, it could cause dizziness, nausea and define your VR experience negatively, likely killing the excitement of the platform.
My conundrum, and the one that I think most of our industry rests in, is that we don’t yet have the tools and ability to properly quantify the performance of VR. In a market and a platform that so desperately needs to get this RIGHT, we are at a point where we are just trying to get it AT ALL. I have read and seen some other glances at performance of VR headsets like the Oculus Rift and the HTC Vive released today, but honest all are missing the mark at some level. Using tools built for traditional PC gaming environments just doesn’t work, and experiential reviews talk about what the gamer can expect to “feel” but lack the data and analysis to back it up and to help point the industry in the right direction to improve in the long run.
With final hardware from both Oculus and HTC / Valve in my hands for the last three weeks, I have, with the help of Ken and Allyn, been diving into the important question of HOW do we properly test VR? I will be upfront: we don’t have a final answer yet. But we have a direction. And we have some interesting results to show you that should prove we are on the right track. But we’ll need help from the likes of Valve, Oculus, AMD, NVIDIA, Intel and Microsoft to get it right. Based on a lot of discussion I’ve had in just the last 2-3 days, I think we are moving in the correct direction.
Why things are different in VR performance testing
So why don’t our existing tools work for testing performance in VR? Things like Fraps, Frame Rating and FCAT have revolutionized performance evaluation for PCs – so why not VR? The short answer is that the gaming pipeline changes in VR with the introduction of two new SDKs: Oculus and OpenVR.
Though both have differences, the key is that they are intercepting the draw ability from the GPU to the screen. When you attach an Oculus Rift or an HTC Vive to your PC it does not show up as a display in your system; this is a change from the first developer kits from Oculus years ago. Now they are driven by what’s known as “direct mode.” This mode offers improved user experiences and the ability for the Oculus an OpenVR systems to help with quite a bit of functionality for game developers. It also means there are actions being taken on the rendered frames after we can last monitor them. At least for today.
A system worthy of VR!
Early this year I started getting request after request for hardware suggestions for upcoming PC builds for VR. The excitement surrounding the Oculus Rift and the HTC Vive has caught fire across all spectrums of technology, from PC enthusiasts to gaming enthusiasts to just those of you interested in a technology that has been "right around the corner" for decades. The requests for build suggestions spanned our normal readership as well as those that had previously only focused on console gaming, and thus the need for a selection of build guides began.
I launched build guides for $900 and $1500 price points earlier in the week, but today we look at the flagship option, targeting a budget of $2500. Though this is a pricey system that should not be undertaken lightly, it is far from a "crazy expensive" build with multiple GPUs, multiple CPUs or high dollar items unnecessary for gaming and VR.
With that in mind, let's jump right into the information you are looking for: the components we recommend.
|VR Build Guide
$2500 Spring 2016
|Component||Amazon.com Link||B&H Photo Link|
|Processor||Intel Core i7-5930K||$527||$578|
|Motherboard||ASUS X99-A USB 3.1||$264||$259|
|Memory||Corsair Dominator Platinum 16GB DDR4-3000||$169|
|Graphics Card||ASUS GeForce GTX 980 Ti STRIX||$659||$669|
|Storage||512GB Samsung 950 Pro
Western Digital Red 4TB
|Power Supply||Corsair HX750i Platinum||$144||$149|
|CPU Cooler||Corsair H100i v2||$107||$107|
|Case||Corsair Carbide 600C||$149||$141|
|Total Price||Full cart - $2,519|
For those of you interested in a bit more detail on the why of the parts selection, rather than just the what, I have some additional information for you.
Unlike the previous two builds that used Intel's consumer Skylake processors, our $2500 build moves to the Haswell-E platform, an enthusiast design that comes from the realm of workstation products. The Core i7-5930K is a 6-core processor with HyperThreading, allowing for 12 addressable threads. Though we are targeting this machine for VR gaming, the move to this processor will mean better performance for other tasks as well including video encoding, photo editing and more. It's unlocked too - so if you want to stretch that clock speed up via overclocking, you have the flexibility for that.
Update: Several people have pointed out that the Core i7-5820K is a very similar processor to the 5930K, with a $100-150 price advantage. It's another great option if you are looking to save a bit more money, and you don't expect to want/need the additional PCI Express lanes the 5930K offers (40 lanes versus 28 lanes).
With the transition to Haswell-E we have an ASUS X99-A USB 3.1 motherboard. This board is the first in our VR builds to support not just 2-Way SLI and CrossFire but 3-Way as well if we find that VR games and engines are able to consistently and properly integrate support for multi-GPU. This recently updated board from ASUS includes USB 3.1 support as you can tell from the name, includes 8 slots for DDR4 memory and offers enough PCIe lanes for expansion in all directions.
Looking to build a PC for the very first time, or need a refresher? You can find our recent step-by-step build videos to help you through the process right here!!
For our graphics card we have gone with the ASUS GeForce GTX 980 Ti Strix. The 980 Ti is the fastest single GPU solution on the market today and with 6GB of memory on-board should be able to handle anything that VR can toss at it. In terms of compute performance the 980 Ti is more than 40% faster than the GTX 980, the GPU used in our $1500 solution. The Strix integration uses a custom cooler that performs much better than the stock solution and is quieter.
Some Hints as to What Comes Next
On March 14 at the Capsaicin event at GDC AMD disclosed their roadmap for GPU architectures through 2018. There were two new names in attendance as well as some hints at what technology will be implemented in these products. It was only one slide, but some interesting information can be inferred from what we have seen and what was said in the event and afterwards during interviews.
Polaris the the next generation of GCN products from AMD that have been shown off for the past few months. Previously in December and at CES we saw the Polaris 11 GPU on display. Very little is known about this product except that it is small and extremely power efficient. Last night we saw the Polaris 10 being run and we only know that it is competitive with current mainstream performance and is larger than the Polaris 11. These products are purportedly based on Samsung/GLOBALFOUNDRIES 14nm LPP.
The source of near endless speculation online.
In the slide AMD showed it listed Polaris as having 2.5X the performance per watt over the previous 28 nm products in AMD’s lineup. This is impressive, but not terribly surprising. AMD and NVIDIA both skipped the 20 nm planar node because it just did not offer up the type of performance and scaling to make sense economically. Simply put, the expense was not worth the results in terms of die size improvements and more importantly power scaling. 20 nm planar just could not offer the type of performance overall that GPU manufacturers could achieve with 2nd and 3rd generation 28nm processes.
What was missing from the slide is mention that Polaris will integrate either HMB1 or HBM2. Vega, the architecture after Polaris, does in fact list HBM2 as the memory technology it will be packaged with. It promises another tick up in terms of performance per watt, but that is going to come more from aggressive design optimizations and likely improvements on FinFET process technologies. Vega will be a 2017 product.
Beyond that we see Navi. It again boasts an improvement in perf per watt as well as the inclusion of a new memory technology behind HBM. Current conjecture is that this could be HMC (hybrid memory cube). I am not entirely certain of that particular conjecture as it does not necessarily improve upon the advantages of current generation HBM and upcoming HBM2 implementations. Navi will not show up until 2018 at the earliest. This *could* be a 10 nm part, but considering the struggle that the industry has had getting to 14/16nm FinFET I am not holding my breath.
AMD provided few details about these products other than what we see here. From here on out is conjecture based upon industry trends, analysis of known roadmaps, and the limitations of the process and memory technologies that are already well known.
Shedding a little light on Monday's announcement
Most of our readers should have some familiarity with GameWorks, which is a series of libraries and utilities that help game developers (and others) create software. While many hardware and platform vendors provide samples and frameworks, taking the brunt of the work required to solve complex problems, this is NVIDIA's branding for their suite of technologies. Their hope is that it pushes the industry forward, which in turn drives GPU sales as users see the benefits of upgrading.
This release, GameWorks SDK 3.1, contains three complete features and two “beta” ones. We will start with the first three, each of which target a portion of the lighting and shadowing problem. The last two, which we will discuss at the end, are the experimental ones and fall under the blanket of physics and visual effects.
The first technology is Volumetric Lighting, which simulates the way light scatters off dust in the atmosphere. Game developers have been approximating this effect for a long time. In fact, I remember a particular section of Resident Evil 4 where you walk down a dim hallway that has light rays spilling in from the windows. Gamecube-era graphics could only do so much, though, and certain camera positions show that the effect was just a translucent, one-sided, decorative plane. It was a cheat that was hand-placed by a clever artist.
GameWorks' Volumetric Lighting goes after the same effect, but with a much different implementation. It looks at the generated shadow maps and, using hardware tessellation, extrudes geometry from the unshadowed portions toward the light. These little bits of geometry sum, depending on how deep the volume is, which translates into the required highlight. Also, since it's hardware tessellated, it probably has a smaller impact on performance because the GPU only needs to store enough information to generate the geometry, not store (and update) the geometry data for all possible light shafts themselves -- and it needs to store those shadow maps anyway.
Even though it seemed like this effect was independent of render method, since it basically just adds geometry to the scene, I asked whether it was locked to deferred rendering methods. NVIDIA said that it should be unrelated, as I suspected, which is good for VR. Forward rendering is easier to anti-alias, which makes the uneven pixel distribution (after lens distortion) appear more smooth.
A start to proper testing
During all the commotion last week surrounding the release of a new Ashes of the Singularity DX12 benchmark, Microsoft's launching of the Gears of War Ultimate Edition on the Windows Store and the company's supposed desire to merge Xbox and PC gaming, a constant source of insight for me was one Andrew Lauritzen. Andrew is a graphics guru at Intel and has extensive knowledge of DirectX, rendering, engines, etc. and has always been willing to teach and educate me on areas that crop up. The entire DirectX 12 and Unified Windows Platform was definitely one such instance.
Yesterday morning Andrew pointed me to a GitHub release for a tool called PresentMon, a small sample of code written by a colleague of Andrew's that might be the beginnings of being able to properly monitor performance of DX12 games and even UWP games.
The idea is simple and it's implementation even more simple: PresentMon monitors the Windows event tracing stack for present commands and records data about them to a CSV file. Anyone familiar with the kind of ETW data you can gather will appreciate that PresentMon culls out nearly all of the headache of data gathering by simplifying the results into application name/ID, Present call deltas and a bit more.
Gears of War Ultimate Edition - the debated UWP version
The "Present" method in Windows is what produces a frame and shows it to the user. PresentMon looks at the Windows events running through the system, takes note of when those present commands are received by the OS for any given application, and records the time between them. Because this tool runs at the OS level, it can capture Present data from all kinds of APIs including DX12, DX11, OpenGL, Vulkan and more. It does have limitations though - it is read only so producing an overlay on the game/application being tested isn't possible today. (Or maybe ever in the case of UWP games.)
What PresentMon offers us at this stage is an early look at a Fraps-like performance monitoring tool. In the same way that Fraps was looking for Present commands from Windows and recording them, PresentMon does the same thing, at a very similar point in the rendering pipeline as well. What is important and unique about PresentMon is that it is API independent and useful for all types of games and programs.
PresentMon at work
The first and obvious question for our readers is how this performance monitoring tool compares with Frame Rating, our FCAT-based capture benchmarking platform we have used on GPUs and CPUs for years now. To be honest, it's not the same and should not be considered an analog to it. Frame Rating and capture-based testing looks for smoothness, dropped frames and performance at the display, while Fraps and PresentMon look at performance closer to the OS level, before the graphics driver really gets the final say in things. I am still targeting for universal DX12 Frame Rating testing with exclusive full screen capable applications and expect that to be ready sooner rather than later. However, what PresentMon does give us is at least an early universal look at DX12 performance including games that are locked behind the Windows Store rules.
Things are about to get...complicated
Earlier this week, the team behind Ashes of the Singularity released an updated version of its early access game, which updated its features and capabilities. With support for DirectX 11 and DirectX 12, and adding in multiple graphics card support, the game featured a benchmark mode that got quite a lot of attention. We saw stories based on that software posted by Anandtech, Guru3D and ExtremeTech, all of which had varying views on the advantages of one GPU or another.
That isn’t the focus of my editorial here today, though.
Shortly after the initial release, a discussion began around results from the Guru3D story that measured frame time consistency and smoothness with FCAT, a capture based testing methodology much like the Frame Rating process we have here at PC Perspective. In that post on ExtremeTech, Joel Hruska claims that the results and conclusion from Guru3D are wrong because the FCAT capture methods make assumptions on the output matching what the user experience feels like. Maybe everyone is wrong?
First a bit of background: I have been working with Oxide and the Ashes of the Singularity benchmark for a couple of weeks, hoping to get a story that I was happy with and felt was complete, before having to head out the door to Barcelona for the Mobile World Congress. That didn’t happen – such is life with an 8-month old. But, in my time with the benchmark, I found a couple of things that were very interesting, even concerning, that I was working through with the developers.
FCAT overlay as part of the Ashes benchmark
First, the initial implementation of the FCAT overlay, which Oxide should be PRAISED for including since we don’t have and likely won’t have a DX12 universal variant of, was implemented incorrectly, with duplication of color swatches that made the results from capture-based testing inaccurate. I don’t know if Guru3D used that version to do its FCAT testing, but I was able to get some updated EXEs of the game through the developer in order to the overlay working correctly. Once that was corrected, I found yet another problem: an issue of frame presentation order on NVIDIA GPUs that likely has to do with asynchronous shaders. Whether that issue is on the NVIDIA driver side or the game engine side is still being investigated by Oxide, but it’s interesting to note that this problem couldn’t have been found without a proper FCAT implementation.
With all of that under the bridge, I set out to benchmark this latest version of Ashes and DX12 to measure performance across a range of AMD and NVIDIA hardware. The data showed some abnormalities, though. Some results just didn’t make sense in the context of what I was seeing in the game and what the overlay results were indicating. It appeared that Vsync (vertical sync) was working differently than I had seen with any other game on the PC.
For the NVIDIA platform, tested using a GTX 980 Ti, the game seemingly randomly starts up with Vsync on or off, with no clear indicator of what was causing it, despite the in-game settings being set how I wanted them. But the Frame Rating capture data was still working as I expected – just because Vsync is enabled doesn’t mean you can look at the results in capture formats. I have written stories on what Vsync enabled captured data looks like and what it means as far back as April 2013. Obviously, to get the best and most relevant data from Frame Rating, setting vertical sync off is ideal. Running into more frustration than answers, I moved over to an AMD platform.