All | Editorial | General Tech | Graphics Cards | Networking | Motherboards | Cases and Cooling | Processors | Chipsets | Memory | Displays | Systems | Storage | Mobile | Shows and Expos
The Dual-Fiji Card Finally Arrives
This weekend, leaks of information on both WCCFTech and VideoCardz.com have revealed all the information about the pending release of AMD’s dual-GPU giant, the Radeon Pro Duo. While no one at PC Perspective has been briefed on the product officially, all of the interesting data surrounding the product is clearly outlined in the slides on those websites, minus some independent benchmark testing that we are hoping to get to next week. Based on the report from both sites, the Radeon Pro Duo will be released on April 26th.
AMD actually revealed the product and branding for the Radeon Pro Duo back in March, during its live streamed Capsaicin event surrounding GDC. At that point we were given the following information:
- Dual Fiji XT GPUs
- 8GB of total HBM memory
- 4x DisplayPort (this has since been modified)
- 16 TFLOPS of compute
- $1499 price tag
The design of the card follows the same industrial design as the reference designs of the Radeon Fury X, and integrates a dual-pump cooler and external fan/radiator to keep both GPUs running cool.
Based on the slides leaked out today, AMD has revised the Radeon Pro Duo design to include a set of three DisplayPort connections and one HDMI port. This was a necessary change as the Oculus Rift requires an HDMI port to work; only the HTC Vive has built in support for a DisplayPort connection and even in that case you would need a full-size to mini-DisplayPort cable.
The 8GB of HBM (high bandwidth memory) on the card is split between the two Fiji XT GPUs on the card, just like other multi-GPU options on the market. The 350 watts power draw mark is exceptionally high, exceeded only by AMD’s previous dual-GPU beast, the Radeon 295X2 that used 500+ watts and the NVIDIA GeForce GTX Titan Z that draws 375 watts!
Here is the specification breakdown of the Radeon Pro Duo. The card has 8192 total stream processors and 128 Compute Units, split evenly between the two GPUs. You are getting two full Fiji XT GPUs in this card, an impressive feat made possible in part by the use of High Bandwidth Memory and its smaller physical footprint.
|Radeon Pro Duo||R9 Nano||R9 Fury||R9 Fury X||GTX 980 Ti||TITAN X||GTX 980||R9 290X|
|GPU||Fiji XT x 2||Fiji XT||Fiji Pro||Fiji XT||GM200||GM200||GM204||Hawaii XT|
|Rated Clock||up to 1000 MHz||up to 1000 MHz||1000 MHz||1050 MHz||1000 MHz||1000 MHz||1126 MHz||1000 MHz|
|Memory||8GB (4GB x 2)||4GB||4GB||4GB||6GB||12GB||4GB||4GB|
|Memory Clock||500 MHz||500 MHz||500 MHz||500 MHz||7000 MHz||7000 MHz||7000 MHz||5000 MHz|
|Memory Interface||4096-bit (HMB) x 2||4096-bit (HBM)||4096-bit (HBM)||4096-bit (HBM)||384-bit||384-bit||256-bit||512-bit|
|Memory Bandwidth||1024 GB/s||512 GB/s||512 GB/s||512 GB/s||336 GB/s||336 GB/s||224 GB/s||320 GB/s|
|TDP||350 watts||175 watts||275 watts||275 watts||250 watts||250 watts||165 watts||290 watts|
|Peak Compute||16.38 TFLOPS||8.19 TFLOPS||7.20 TFLOPS||8.60 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||5.63 TFLOPS|
|Transistor Count||8.9B x 2||8.9B||8.9B||8.9B||8.0B||8.0B||5.2B||6.2B|
The Radeon Pro Duo has a rated clock speed of up to 1000 MHz. That’s the same clock speed as the R9 Fury and the rated “up to” frequency on the R9 Nano. It’s worth noting that we did see a handful of instances where the R9 Nano’s power limiting capability resulted in some extremely variable clock speeds in practice. AMD recently added a feature to its Crimson driver to disable power metering on the Nano, at the expense of more power draw, and I would assume the same option would work for the Pro Duo.
93% of a GP100 at least...
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.
Why things are different in VR performance testing
It has been an interesting past several weeks and I find myself in an interesting spot. Clearly, and without a shred of doubt, virtual reality, more than any other gaming platform that has come before it, needs an accurate measure of performance and experience. With traditional PC gaming, if you dropped a couple of frames, or saw a slightly out of sync animation, you might notice and get annoyed. But in VR, with a head-mounted display just inches from your face taking up your entire field of view, a hitch in frame or a stutter in motion can completely ruin the immersive experience that the game developer is aiming to provide. Even worse, it could cause dizziness, nausea and define your VR experience negatively, likely killing the excitement of the platform.
My conundrum, and the one that I think most of our industry rests in, is that we don’t yet have the tools and ability to properly quantify the performance of VR. In a market and a platform that so desperately needs to get this RIGHT, we are at a point where we are just trying to get it AT ALL. I have read and seen some other glances at performance of VR headsets like the Oculus Rift and the HTC Vive released today, but honest all are missing the mark at some level. Using tools built for traditional PC gaming environments just doesn’t work, and experiential reviews talk about what the gamer can expect to “feel” but lack the data and analysis to back it up and to help point the industry in the right direction to improve in the long run.
With final hardware from both Oculus and HTC / Valve in my hands for the last three weeks, I have, with the help of Ken and Allyn, been diving into the important question of HOW do we properly test VR? I will be upfront: we don’t have a final answer yet. But we have a direction. And we have some interesting results to show you that should prove we are on the right track. But we’ll need help from the likes of Valve, Oculus, AMD, NVIDIA, Intel and Microsoft to get it right. Based on a lot of discussion I’ve had in just the last 2-3 days, I think we are moving in the correct direction.
Why things are different in VR performance testing
So why don’t our existing tools work for testing performance in VR? Things like Fraps, Frame Rating and FCAT have revolutionized performance evaluation for PCs – so why not VR? The short answer is that the gaming pipeline changes in VR with the introduction of two new SDKs: Oculus and OpenVR.
Though both have differences, the key is that they are intercepting the draw ability from the GPU to the screen. When you attach an Oculus Rift or an HTC Vive to your PC it does not show up as a display in your system; this is a change from the first developer kits from Oculus years ago. Now they are driven by what’s known as “direct mode.” This mode offers improved user experiences and the ability for the Oculus an OpenVR systems to help with quite a bit of functionality for game developers. It also means there are actions being taken on the rendered frames after we can last monitor them. At least for today.
A system worthy of VR!
Early this year I started getting request after request for hardware suggestions for upcoming PC builds for VR. The excitement surrounding the Oculus Rift and the HTC Vive has caught fire across all spectrums of technology, from PC enthusiasts to gaming enthusiasts to just those of you interested in a technology that has been "right around the corner" for decades. The requests for build suggestions spanned our normal readership as well as those that had previously only focused on console gaming, and thus the need for a selection of build guides began.
I launched build guides for $900 and $1500 price points earlier in the week, but today we look at the flagship option, targeting a budget of $2500. Though this is a pricey system that should not be undertaken lightly, it is far from a "crazy expensive" build with multiple GPUs, multiple CPUs or high dollar items unnecessary for gaming and VR.
With that in mind, let's jump right into the information you are looking for: the components we recommend.
|VR Build Guide
$2500 Spring 2016
|Component||Amazon.com Link||B&H Photo Link|
|Processor||Intel Core i7-5930K||$527||$578|
|Motherboard||ASUS X99-A USB 3.1||$264||$259|
|Memory||Corsair Dominator Platinum 16GB DDR4-3000||$169|
|Graphics Card||ASUS GeForce GTX 980 Ti STRIX||$659||$669|
|Storage||512GB Samsung 950 Pro
Western Digital Red 4TB
|Power Supply||Corsair HX750i Platinum||$144||$149|
|CPU Cooler||Corsair H100i v2||$107||$107|
|Case||Corsair Carbide 600C||$149||$141|
|Total Price||Full cart - $2,519|
For those of you interested in a bit more detail on the why of the parts selection, rather than just the what, I have some additional information for you.
Unlike the previous two builds that used Intel's consumer Skylake processors, our $2500 build moves to the Haswell-E platform, an enthusiast design that comes from the realm of workstation products. The Core i7-5930K is a 6-core processor with HyperThreading, allowing for 12 addressable threads. Though we are targeting this machine for VR gaming, the move to this processor will mean better performance for other tasks as well including video encoding, photo editing and more. It's unlocked too - so if you want to stretch that clock speed up via overclocking, you have the flexibility for that.
Update: Several people have pointed out that the Core i7-5820K is a very similar processor to the 5930K, with a $100-150 price advantage. It's another great option if you are looking to save a bit more money, and you don't expect to want/need the additional PCI Express lanes the 5930K offers (40 lanes versus 28 lanes).
With the transition to Haswell-E we have an ASUS X99-A USB 3.1 motherboard. This board is the first in our VR builds to support not just 2-Way SLI and CrossFire but 3-Way as well if we find that VR games and engines are able to consistently and properly integrate support for multi-GPU. This recently updated board from ASUS includes USB 3.1 support as you can tell from the name, includes 8 slots for DDR4 memory and offers enough PCIe lanes for expansion in all directions.
Looking to build a PC for the very first time, or need a refresher? You can find our recent step-by-step build videos to help you through the process right here!!
For our graphics card we have gone with the ASUS GeForce GTX 980 Ti Strix. The 980 Ti is the fastest single GPU solution on the market today and with 6GB of memory on-board should be able to handle anything that VR can toss at it. In terms of compute performance the 980 Ti is more than 40% faster than the GTX 980, the GPU used in our $1500 solution. The Strix integration uses a custom cooler that performs much better than the stock solution and is quieter.
Some Hints as to What Comes Next
On March 14 at the Capsaicin event at GDC AMD disclosed their roadmap for GPU architectures through 2018. There were two new names in attendance as well as some hints at what technology will be implemented in these products. It was only one slide, but some interesting information can be inferred from what we have seen and what was said in the event and afterwards during interviews.
Polaris the the next generation of GCN products from AMD that have been shown off for the past few months. Previously in December and at CES we saw the Polaris 11 GPU on display. Very little is known about this product except that it is small and extremely power efficient. Last night we saw the Polaris 10 being run and we only know that it is competitive with current mainstream performance and is larger than the Polaris 11. These products are purportedly based on Samsung/GLOBALFOUNDRIES 14nm LPP.
The source of near endless speculation online.
In the slide AMD showed it listed Polaris as having 2.5X the performance per watt over the previous 28 nm products in AMD’s lineup. This is impressive, but not terribly surprising. AMD and NVIDIA both skipped the 20 nm planar node because it just did not offer up the type of performance and scaling to make sense economically. Simply put, the expense was not worth the results in terms of die size improvements and more importantly power scaling. 20 nm planar just could not offer the type of performance overall that GPU manufacturers could achieve with 2nd and 3rd generation 28nm processes.
What was missing from the slide is mention that Polaris will integrate either HMB1 or HBM2. Vega, the architecture after Polaris, does in fact list HBM2 as the memory technology it will be packaged with. It promises another tick up in terms of performance per watt, but that is going to come more from aggressive design optimizations and likely improvements on FinFET process technologies. Vega will be a 2017 product.
Beyond that we see Navi. It again boasts an improvement in perf per watt as well as the inclusion of a new memory technology behind HBM. Current conjecture is that this could be HMC (hybrid memory cube). I am not entirely certain of that particular conjecture as it does not necessarily improve upon the advantages of current generation HBM and upcoming HBM2 implementations. Navi will not show up until 2018 at the earliest. This *could* be a 10 nm part, but considering the struggle that the industry has had getting to 14/16nm FinFET I am not holding my breath.
AMD provided few details about these products other than what we see here. From here on out is conjecture based upon industry trends, analysis of known roadmaps, and the limitations of the process and memory technologies that are already well known.
Shedding a little light on Monday's announcement
Most of our readers should have some familiarity with GameWorks, which is a series of libraries and utilities that help game developers (and others) create software. While many hardware and platform vendors provide samples and frameworks, taking the brunt of the work required to solve complex problems, this is NVIDIA's branding for their suite of technologies. Their hope is that it pushes the industry forward, which in turn drives GPU sales as users see the benefits of upgrading.
This release, GameWorks SDK 3.1, contains three complete features and two “beta” ones. We will start with the first three, each of which target a portion of the lighting and shadowing problem. The last two, which we will discuss at the end, are the experimental ones and fall under the blanket of physics and visual effects.
The first technology is Volumetric Lighting, which simulates the way light scatters off dust in the atmosphere. Game developers have been approximating this effect for a long time. In fact, I remember a particular section of Resident Evil 4 where you walk down a dim hallway that has light rays spilling in from the windows. Gamecube-era graphics could only do so much, though, and certain camera positions show that the effect was just a translucent, one-sided, decorative plane. It was a cheat that was hand-placed by a clever artist.
GameWorks' Volumetric Lighting goes after the same effect, but with a much different implementation. It looks at the generated shadow maps and, using hardware tessellation, extrudes geometry from the unshadowed portions toward the light. These little bits of geometry sum, depending on how deep the volume is, which translates into the required highlight. Also, since it's hardware tessellated, it probably has a smaller impact on performance because the GPU only needs to store enough information to generate the geometry, not store (and update) the geometry data for all possible light shafts themselves -- and it needs to store those shadow maps anyway.
Even though it seemed like this effect was independent of render method, since it basically just adds geometry to the scene, I asked whether it was locked to deferred rendering methods. NVIDIA said that it should be unrelated, as I suspected, which is good for VR. Forward rendering is easier to anti-alias, which makes the uneven pixel distribution (after lens distortion) appear more smooth.
A start to proper testing
During all the commotion last week surrounding the release of a new Ashes of the Singularity DX12 benchmark, Microsoft's launching of the Gears of War Ultimate Edition on the Windows Store and the company's supposed desire to merge Xbox and PC gaming, a constant source of insight for me was one Andrew Lauritzen. Andrew is a graphics guru at Intel and has extensive knowledge of DirectX, rendering, engines, etc. and has always been willing to teach and educate me on areas that crop up. The entire DirectX 12 and Unified Windows Platform was definitely one such instance.
Yesterday morning Andrew pointed me to a GitHub release for a tool called PresentMon, a small sample of code written by a colleague of Andrew's that might be the beginnings of being able to properly monitor performance of DX12 games and even UWP games.
The idea is simple and it's implementation even more simple: PresentMon monitors the Windows event tracing stack for present commands and records data about them to a CSV file. Anyone familiar with the kind of ETW data you can gather will appreciate that PresentMon culls out nearly all of the headache of data gathering by simplifying the results into application name/ID, Present call deltas and a bit more.
Gears of War Ultimate Edition - the debated UWP version
The "Present" method in Windows is what produces a frame and shows it to the user. PresentMon looks at the Windows events running through the system, takes note of when those present commands are received by the OS for any given application, and records the time between them. Because this tool runs at the OS level, it can capture Present data from all kinds of APIs including DX12, DX11, OpenGL, Vulkan and more. It does have limitations though - it is read only so producing an overlay on the game/application being tested isn't possible today. (Or maybe ever in the case of UWP games.)
What PresentMon offers us at this stage is an early look at a Fraps-like performance monitoring tool. In the same way that Fraps was looking for Present commands from Windows and recording them, PresentMon does the same thing, at a very similar point in the rendering pipeline as well. What is important and unique about PresentMon is that it is API independent and useful for all types of games and programs.
PresentMon at work
The first and obvious question for our readers is how this performance monitoring tool compares with Frame Rating, our FCAT-based capture benchmarking platform we have used on GPUs and CPUs for years now. To be honest, it's not the same and should not be considered an analog to it. Frame Rating and capture-based testing looks for smoothness, dropped frames and performance at the display, while Fraps and PresentMon look at performance closer to the OS level, before the graphics driver really gets the final say in things. I am still targeting for universal DX12 Frame Rating testing with exclusive full screen capable applications and expect that to be ready sooner rather than later. However, what PresentMon does give us is at least an early universal look at DX12 performance including games that are locked behind the Windows Store rules.
Things are about to get...complicated
Earlier this week, the team behind Ashes of the Singularity released an updated version of its early access game, which updated its features and capabilities. With support for DirectX 11 and DirectX 12, and adding in multiple graphics card support, the game featured a benchmark mode that got quite a lot of attention. We saw stories based on that software posted by Anandtech, Guru3D and ExtremeTech, all of which had varying views on the advantages of one GPU or another.
That isn’t the focus of my editorial here today, though.
Shortly after the initial release, a discussion began around results from the Guru3D story that measured frame time consistency and smoothness with FCAT, a capture based testing methodology much like the Frame Rating process we have here at PC Perspective. In that post on ExtremeTech, Joel Hruska claims that the results and conclusion from Guru3D are wrong because the FCAT capture methods make assumptions on the output matching what the user experience feels like. Maybe everyone is wrong?
First a bit of background: I have been working with Oxide and the Ashes of the Singularity benchmark for a couple of weeks, hoping to get a story that I was happy with and felt was complete, before having to head out the door to Barcelona for the Mobile World Congress. That didn’t happen – such is life with an 8-month old. But, in my time with the benchmark, I found a couple of things that were very interesting, even concerning, that I was working through with the developers.
FCAT overlay as part of the Ashes benchmark
First, the initial implementation of the FCAT overlay, which Oxide should be PRAISED for including since we don’t have and likely won’t have a DX12 universal variant of, was implemented incorrectly, with duplication of color swatches that made the results from capture-based testing inaccurate. I don’t know if Guru3D used that version to do its FCAT testing, but I was able to get some updated EXEs of the game through the developer in order to the overlay working correctly. Once that was corrected, I found yet another problem: an issue of frame presentation order on NVIDIA GPUs that likely has to do with asynchronous shaders. Whether that issue is on the NVIDIA driver side or the game engine side is still being investigated by Oxide, but it’s interesting to note that this problem couldn’t have been found without a proper FCAT implementation.
With all of that under the bridge, I set out to benchmark this latest version of Ashes and DX12 to measure performance across a range of AMD and NVIDIA hardware. The data showed some abnormalities, though. Some results just didn’t make sense in the context of what I was seeing in the game and what the overlay results were indicating. It appeared that Vsync (vertical sync) was working differently than I had seen with any other game on the PC.
For the NVIDIA platform, tested using a GTX 980 Ti, the game seemingly randomly starts up with Vsync on or off, with no clear indicator of what was causing it, despite the in-game settings being set how I wanted them. But the Frame Rating capture data was still working as I expected – just because Vsync is enabled doesn’t mean you can look at the results in capture formats. I have written stories on what Vsync enabled captured data looks like and what it means as far back as April 2013. Obviously, to get the best and most relevant data from Frame Rating, setting vertical sync off is ideal. Running into more frustration than answers, I moved over to an AMD platform.
Caught Up to DirectX 12 in a Single Day
I'm not just talking about the specification. Members of the Khronos Group have also released compatible drivers, SDKs and tools to support them, conformance tests, and a proof-of-concept patch for Croteam's The Talos Principle. To reiterate, this is not a soft launch. The API, and its entire ecosystem, is out and ready for the public on Windows (at least 7+ at launch but a surprise Vista or XP announcement is technically possible) and several distributions of Linux. Google will provide an Android SDK in the near future.
I'm going to editorialize for the next two paragraphs. There was a concern that Vulkan would be too late. The thing is, as of today, Vulkan is now just as mature as DirectX 12. Of course, that could change at a moment's notice; we still don't know how the two APIs are being adopted behind the scenes. A few DirectX 12 titles are planned to launch in a few months, but no full, non-experimental, non-early access game currently exists. Each time I say this, someone links the Wikipedia list of DirectX 12 games. If you look at each entry, though, you'll see that all of them are either: early access, awaiting an unreleased DirectX 12 patch, or using a third-party engine (like Unreal Engine 4) that only list DirectX 12 as an experimental preview. No full, released, non-experimental DirectX 12 game exists today. Besides, if the latter counts, then you'll need to accept The Talos Principle's proof-of-concept patch, too.
But again, that could change. While today's launch speaks well to the Khronos Group and the API itself, it still needs to be adopted by third party engines, middleware, and software. These partners could, like the Khronos Group before today, be privately supporting Vulkan with the intent to flood out announcements; we won't know until they do... or don't. With the support of popular engines and frameworks, dependent software really just needs to enable it. This has not happened for DirectX 12 yet, and, now, there doesn't seem to be anything keeping it from happening for Vulkan at any moment. With the Game Developers Conference just a month away, we should soon find out.
But back to the announcement.
Vulkan-compatible drivers are launching today across multiple vendors and platforms, but I do not have a complete list. On Windows, I was told to expect drivers from NVIDIA for Windows 7, 8.x, 10 on Kepler and Maxwell GPUs. The standard is compatible with Fermi GPUs, but NVIDIA does not plan on supporting the API for those users due to its low market share. That said, they are paying attention to user feedback and they are not ruling it out, which probably means that they are keeping an open mind in case some piece of software gets popular and depends upon Vulkan. I have not heard from AMD or Intel about Vulkan drivers as of this writing, one way or the other. They could even arrive day one.
On Linux, NVIDIA, Intel, and Imagination Technologies have submitted conformant drivers.
Drivers alone do not make a hard launch, though. SDKs and tools have also arrived, including the LunarG SDK for Windows and Linux. LunarG is a company co-founded by Lens Owen, who had a previous graphics software company that was purchased by VMware. LunarG is backed by Valve, who also backed Vulkan in several other ways. The LunarG SDK helps developers validate their code, inspect what the API is doing, and otherwise debug. Even better, it is also open source, which means that the community can rapidly enhance it, even though it's in a releasable state as it is. RenderDoc,
the open-source graphics debugger by Crytek, will also add Vulkan support. ((Update (Feb 16 @ 12:39pm EST): Baldur Karlsson has just emailed me to let me know that it was a personal project at Crytek, not a Crytek project in general, and their GitHub page is much more up-to-date than the linked site.))
The major downside is that Vulkan (like Mantle and DX12) isn't simple.
These APIs are verbose and very different from previous ones, which requires more effort.
Image Credit: NVIDIA
There really isn't much to say about the Vulkan launch beyond this. What graphics APIs really try to accomplish is standardizing signals that enter and leave video cards, such that the GPUs know what to do with them. For the last two decades, we've settled on an arbitrary, single, global object that you attach buffers of data to, in specific formats, and call one of a half-dozen functions to send it.
Compute APIs, like CUDA and OpenCL, decided it was more efficient to handle queues, allowing the application to write commands and send them wherever they need to go. Multiple threads can write commands, and multiple accelerators (GPUs in our case) can be targeted individually. Vulkan, like Mantle and DirectX 12, takes this metaphor and adds graphics-specific instructions to it. Moreover, GPUs can schedule memory, compute, and graphics instructions at the same time, as long as the graphics task has leftover compute and memory resources, and / or the compute task has leftover memory resources.
This is not necessarily a “better” way to do graphics programming... it's different. That said, it has the potential to be much more efficient when dealing with lots of simple tasks that are sent from multiple CPU threads, especially to multiple GPUs (which currently require the driver to figure out how to convert draw calls into separate workloads -- leading to simplifications like mirrored memory and splitting workload by neighboring frames). Lots of tasks aligns well with video games, especially ones with lots of simple objects, like strategy games, shooters with lots of debris, or any game with large crowds of people. As it becomes ubiquitous, we'll see this bottleneck disappear and games will not need to be designed around these limitations. It might even be used for drawing with cross-platform 2D APIs, like Qt or even webpages, although those two examples (especially the Web) each have other, higher-priority bottlenecks. There are also other benefits to Vulkan.
The WebGL comparison is probably not as common knowledge as Khronos Group believes.
Still, Khronos Group was criticized when WebGL launched as "it was too tough for Web developers".
It didn't need to be easy. Frameworks arrived and simplified everything. It's now ubiquitous.
In fact, Adobe Animate CC (the successor to Flash Pro) is now a WebGL editor (experimentally).
Open platforms are required for this to become commonplace. Engines will probably target several APIs from their internal management APIs, but you can't target users who don't fit in any bucket. Vulkan brings this capability to basically any platform, as long as it has a compute-capable GPU and a driver developer who cares.
Thankfully, it arrived before any competitor established market share.
Early testing for higher end GPUs
UPDATE 2/5/16: Nixxes released a new version of Rise of the Tomb Raider today with some significant changes. I have added another page at the end of this story that looks at results with the new version of the game, a new AMD driver and I've also included some SLI and CrossFire results.
I will fully admit to being jaded by the industry on many occasions. I love my PC games and I love hardware but it takes a lot for me to get genuinely excited about anything. After hearing game reviewers talk up the newest installment of the Tomb Raider franchise, Rise of the Tomb Raider, since it's release on the Xbox One last year, I've been waiting for its PC release to give it a shot with real hardware. As you'll see in the screenshots and video in this story, the game doesn't appear to disappoint.
Rise of the Tomb Raider takes the exploration and "tomb raiding" aspects that made the first games in the series successful and applies them to the visual quality and character design brought in with the reboot of the series a couple years back. The result is a PC game that looks stunning at any resolution, but even more so in 4K, that pushes your hardware to its limits. For single GPU performance, even the GTX 980 Ti and Fury X struggle to keep their heads above water.
In this short article we'll look at the performance of Rise of the Tomb Raider with a handful of GPUs, leaning towards the high end of the product stack, and offer up my view on whether each hardware vendor is living up to expectations.
Are Computers Still Getting Faster?
It looks like CES is starting to wind down, which makes sense because it ended three days ago. Now that we're mostly caught up, I found a new video from The 8-Bit Guy. He doesn't really explain any old technologies in this one. Instead, he poses an open question about computer speed. He was able to have a functional computing experience on a ten-year-old Apple laptop, which made him wonder if the rate of computer advancement is slowing down.
I believe that he (and his guest hosts) made great points, but also missed a few important ones.
One of his main arguments is that software seems to have slowed down relative to hardware. I don't believe that is true, but I believe it's looking in the right area. PCs these days are more than capable of doing just about anything in terms of 2D user interface that we would want to, and do so with a lot of overhead for inefficient platforms and sub-optimal programming (relative to the 80's and 90's at the very least). The areas that require extra horsepower are usually doing large batches of many related tasks. GPUs are key in this area, and they are keeping up as fast as they can, despite some stagnation with fabrication processes and a difficulty (at least before HBM takes hold) in keeping up with memory bandwidth.
For the last five years to ten years or so, CPUs have been evolving toward efficiency as GPUs are being adopted for the tasks that need to scale up. I'm guessing that AMD, when they designed the Bulldozer architecture, hoped that GPUs would have been adopted much more aggressively, but even as graphics devices, they now have a huge effect on Web, UI, and media applications.
These are also tasks that can scale well between devices by lowering resolution (and so forth). The primary thing that a main CPU thread needs to do is figure out the system's state and keep the graphics card fed before the frame-train leaves the station. In my experience, that doesn't scale well (although you can sometimes reduce the amount of tracked objects for games and so forth). Moreover, it is easier to add GPU performance, compared to single-threaded CPU, because increasing frequency and single-threaded IPC should be more complicated than planning out more, duplicated blocks of shaders. These factors combine to give lower-end hardware a similar experience in the most noticeable areas.
So, up to this point, we discussed:
- Software is often scaling in ways that are GPU (and RAM) limited.
- CPUs are scaling down in power more than up in performance.
- GPU-limited tasks can often be approximated with smaller workloads.
- Software gets heavier, but it doesn't need to be "all the way up" (ex: resolution).
- Some latencies are hard to notice anyway.
Back to the Original Question
This is where “Are computers still getting faster?” can be open to interpretation.
Tasks are diverging from one class of processor into two, and both have separate industries, each with their own, multiple goals. As stated, CPUs are mostly progressing in power efficiency, which extends (an assumed to be) sufficient amount of performance downward to multiple types of devices. GPUs are definitely getting faster, but they can't do everything. At the same time, RAM is plentiful but its contribution to performance can be approximated with paging unused chunks to the hard disk or, more recently on Windows, compressing them in-place. Newer computers with extra RAM won't help as long as any single task only uses a manageable amount of it -- unless it's seen from a viewpoint that cares about multi-tasking.
In short, computers are still progressing, but the paths are now forked and winding.
AMD Polaris Architecture Coming Mid-2016
In early December, I was able to spend some time with members of the newly formed Radeon Technologies Group (RTG), which is a revitalized and compartmentalized section of AMD that is taking over all graphics work. During those meetings, I was able to learn quite a bit about the plans for RTG going forward, including changes for AMD FreeSync and implementation of HDR display technology, and their plans for the GPUOpen open-sourced game development platform. Perhaps most intriguing of all: we received some information about the next-generation GPU architecture, targeted for 2016.
Codenamed Polaris, this new architecture will be the 4th generation of GCN (Graphics Core Next), and it will be the first AMD GPU that is built on FinFET process technology. These two changes combined promise to offer the biggest improvement in performance per watt, generation to generation, in AMD’s history.
Though the amount of information provided about the Polaris architecture is light, RTG does promise some changes to the 4th iteration of its GCN design. Those include primitive discard acceleration, an improved hardware scheduler, better pre-fetch, increased shader efficiency, and stronger memory compression. We have already discussed in a previous story that the new GPUs will include HDMI 2.0a and DisplayPort 1.3 display interfaces, which offer some impressive new features and bandwidth. From a multimedia perspective, Polaris will be the first GPU to include support for h.265 4K decode and encode acceleration.
This slide shows us quite a few changes, most of which were never discussed specifically that we can report, coming to Polaris. Geometry processing and the memory controller stand out as potentially interesting to me – AMD’s Fiji design continues to lag behind NVIDIA’s Maxwell in terms of tessellation performance and we would love to see that shift. I am also very curious to see how the memory controller is configured on the entire Polaris lineup of GPUs – we saw the introduction of HBM (high bandwidth memory) with the Fury line of cards.
May the Radeon be with You
In celebration of the release of The Force Awakens as well as the new Star Wars Battlefront game from DICE and EA, AMD sent over some hardware for us to use in a system build, targeted at getting users up and running in Battlefront with impressive quality and performance, but still on a reasonable budget. Pairing up an AMD processor, MSI motherboard, Sapphire GPU with a low cost chassis, SSD and more, the combined system includes a FreeSync monitor for around $1,200.
Holiday breaks are MADE for Star Wars Battlefront
Though the holiday is already here and you'd be hard pressed to build this system in time for it, I have a feeling that quite a few of our readers and viewers will find themselves with some cash and gift certificates in hand, just ITCHING for a place to invest in a new gaming PC.
The video above includes a list of components, the build process (in brief) and shows us getting our gaming on with Star Wars Battlefront. Interested in building a system similar the one above on your own? Here's the hardware breakdown.
|AMD Powered Star Wars Battlefront System|
|Processor||AMD FX-8370 - $197
Cooler Master Hyper 212 EVO - $29
|Motherboard||MSI 990FXA Gaming - $137|
|Memory||AMD Radeon Memory DDR3-2400 - $79|
|Graphics Card||Sapphire NITRO Radeon R9 380X - $266|
|Storage||SanDisk Ultra II 240GB SSD - $79|
|Case||Corsair Carbide 300R - $68|
|Power Supply||Seasonic 600 watt 80 Plus - $69|
|Monitor||AOC G2460PF 1920x1080 144Hz FreeSync - $259|
|Total Price||Full System (without monitor) - Amazon.com - $924|
For under $1,000, plus another $250 or so for the AOC FreeSync capable 1080p monitor, you can have a complete gaming rig for your winter break. Let's detail some of the specific components.
AMD sent over the FX-8370 processor for our build, a 4-module / 8-core CPU that runs at 4.0 GHz, more than capable of handling any gaming work load you can toss at it. And if you need to do some transcoding, video work or, heaven forbid, school or productivity work, the FX-8370 has you covered there too.
For the motherboard AMD sent over the MSI 990FXA Gaming board, one of the newer AMD platforms that includes support for USB 3.1 so you'll have a good length of usability for future expansion. The Cooler Master Hyper 212 EVO cooler was our selection to keep the FX-8370 running smoothly and 8GB of AMD Radeon DDR3-2133 memory is enough for the system to keep applications and the Windows 10 operating system happy.
Open Source your GPU!
As part of the AMD’s recent RTG (Radeon Technologies Group) Summit in Sonoma, the company released information about a new initiative to help drive development and evolution in the world of gaming called GPUOpen. As the name implies, the idea is to use an open source mentality to drivers, libraries, SDKs and more to improve the relationship between AMD’s hardware and the gaming development ecosystem.
When the current generation of consoles was first announced, AMD was riding a wave of positive PR that it hadn’t felt in many years. Because AMD Radeon hardware was at the root of the PlayStation 4 and the Xbox One, game developers would become much more adept at programming for AMD’s GCN architecture and that would waterfall down to PC gamers. At least, that was the plan. In practice though I think you’d be hard pressed to find any analyst to put their name on a statement claiming that proclamation from AMD actually transpired. It just hasn’t happened – but that does not mean that it still can’t if all the pieces fall into place.
The issue that AMD, NVIDIA, and game developers have to work around is a divided development ecosystem. While on the console side programmers tend to have very close to the metal access on CPU and GPU hardware, that hasn’t been the case with PCs until very recently. AMD was the first to make moves in this area with the Mantle API but now we have DirectX 12, a competing low level API, that will have much wider reach than Mantle or Vulkan (what Mantle has become).
AMD also believes, as do many developers, that a “black box” development environment for tools and effects packages is having a negative effect on the PC gaming ecosystem. The black box mentality means that developers don’t have access to the source code of some packages and thus cannot tweak performance and features to their liking.
What RTG has planned for 2016
Last week the Radeon Technology Group invited a handful of press and analysts to a secluded location in Sonoma, CA to discuss the future of graphics, GPUs and of course Radeon. For those of you that seem a bit confused, the RTG (Radeon Technologies Group) was spun up inside AMD to encompass all of the graphics products and IP inside the company. Though today’s story is not going to focus on the fundamental changes that RTG brings to the future of AMD, I will note, without commentary, that we saw not a single AMD logo in our presentations or in the signage present throughout the week.
Much of what I learned during the RTG Summit in Sonoma is under NDA and will likely be so for some time. We learned about the future architectures, direction and product theories that will find their way into a range of solutions available in 2016 and 2017.
What I can discuss today is a pair of features that are being updated and improved for current generation graphics cards and for Radeon GPUs coming in 2016: FreeSync and HDR displays. The former is one that readers of PC Perspective should be very familiar with while the latter will offer a new window into content coming in late 2016.
High Dynamic Range Displays: Better Pixels
In just the last couple of years we have seen a spike in resolution for mobile, desktop and notebook displays. We now regularly have 4K monitors on sale for around $500 and very good quality 4K panels going for something in the $1000 range. Couple that with the increase in market share of 21:9 panels with 3440x1440 resolutions and clearly there is a demand from consumers for a better visual experience on their PCs.
But what if the answer isn’t just more pixels, but better pixels? We already have this discussed weekly when comparing render resolutions in games of 4K at lower image quality solutions versus 2560x1440 at maximum IQ settings (for example) but the truth is that panel technology has the ability to make a dramatic change to how we view all content – games, movies, productivity – with the introduction of HDR, high dynamic range.
As the slide above demonstrates there is a wide range of luminance in the real world that our eyes can see. Sunlight crosses the 1.6 billion nits mark while basic fluorescent lighting in our homes and offices exceeds 10,000 nits. Compare to the most modern PC displays that range from 0.1 nits to 250 nits and you can already tell where the discussion is heading. Even the best LCD TVs on the market today have a range of 0.1 to 400 nits.
FreeSync and Frame Pacing Get a Boost
Make sure you catch today's live stream we are hosting with AMD to discuss much more about the new Radeon Software Crimson driver. We are giving away four Radeon graphics cards as well!! Find all the information right here.
Earlier this month AMD announced plans to end the life of the Catalyst Control Center application for control of your Radeon GPU, introducing a new brand simply called Radeon Software. The first iteration of this software, Crimson, is being released today and includes some impressive user experience changes that are really worth seeing and, well, experiencing.
Users will no doubt lament the age of the previous Catalyst Control Center; it was slow, clunky and difficult to navigate around. Radeon Software Crimson changes all of this with a new UI, a new backend that allows it to start up almost instantly, as well as a handful of new features that might be a surprise to some of our readers. Here's a quick rundown of what stands out to me:
- Opens in less than a second in my testing
- Completely redesigned and modern user interface
- Faster display initialization
- New clean install utility (separate download)
- Per-game Overdrive (overclocking) settings
- LiquidVR integration
- FreeSync improvements at low frame rates
- FreeSync planned for HDMI (though not implemented yet)
- Frame pacing support in DX9 titles
- New custom resolution support
- Desktop-based Virtual Super Resolution
- Directional scaling for 2K to 4K upscaling (Fiji GPUs only)
- Shader cache (precompiled) to reduce compiling-induced frame time variance
- Non-specific DX12 improvements
- Flip queue size optimizations (frame buffer length) for specific games
- Wider target range for Frame Rate Target Control
That's quite a list of new features, some of which will be more popular than others, but it looks like there should be something for everyone to love about the new Crimson software package from AMD.
For this story today I wanted to focus on two of the above features that have long been a sticking point for me, and see how well AMD has fixed them with the first release of Radeon Software.
FreeSync: Low Frame Rate Compensation
I might be slightly biased, but I don't think anyone has done a more thorough job of explaining and diving into the differences between AMD FreeSync and NVIDIA G-Sync than the team at PC Perspective. Since day one of the G-Sync variable refresh release we have been following the changes and capabilities of these competing features and writing about what really separates them from a technological point of view, not just pricing and perceived experiences.
Four High Powered Mini ITX Systems
Thanks to Sebastian for helping me out with some of the editorial for this piece and to Ken for doing the installation and testing on the system builds! -Ryan
Update (1/23/16): Now that that AMD Radeon R9 Nano is priced at just $499, it becomes an even better solution for these builds, dropping prices by $150 each.
While some might wonder where the new Radeon R9 Nano fits in a market that offers the AMD Fury X for the same price, the Nano is a product that defines a new category in the PC enthusiast community. It is a full-scale GPU on an impossibly small 6-inch PCB, containing the same core as the larger liquid-cooled Fury X, but requiring 100 watts less power than Fury X and cooled by a single-fan dual-slot air cooler.
The R9 Nano design screams compatibility. It has the ability to fit into virtually any enclosure (including many of the smallest mini-ITX designs), as long as the case supports a dual-slot (full height) GPU. The total board length of 6 inches is shorter than a mini-ITX motherboard, which is 6.7 inches square! Truly, the Nano has the potential to change everything when it comes to selecting a small form-factor (SFF) enclosure.
Typically, a gaming-friendly enclosure would need at minimum a ~270 mm GPU clearance, as a standard 10.5-inch reference GPU translates into 266.7 mm in length. Even very small mini-ITX enclosures have had to position components specifically to allow for these longer cards – if they wanted to be marketed as compatible with a full-size GPU solution, of course. Now with the R9 Nano, smaller and more powerful than any previous ITX-specific graphics card to date, one of the first questions we had was a pretty basic one: what enclosure should we put this R9 Nano into?
With no shortage of enclosures at our disposal to try out a build with this new card, we quickly discovered that many of them shared a design choice: room for a full-length GPU. So, what’s the advantage of the Nano’s incredibly compact size? It must be pointed out that larger (and faster) Fury X has the same MSRP, and at 7.5 inches the Fury X will fit comfortably in cases that have spacing for the necessary radiator.
Finding a Case for Nano
While even some of the tiniest mini-ITX enclosures (EVGA Hadron, NCASE M1, etc.) offer support for a 10.5-in GPU, there are several compact mini-ITX cases that don’t support a full-length graphics card due to their small footprint. While by no means a complete list, here are some of the options out there (note: there are many more mini-ITX cases that don’t support a full-height or dual-slot expansion card at all, such as slim HTPC enclosures):
|Cooler Master||Elite 110||$47.99, Amazon.com|
|Lian Li||PC-O5||$377, Amazon.com|
|Lian Li||PC-Q01||$59.99, Newegg.com|
|Lian Li||PC-Q03||$74.99, Newegg.com|
|Lian Li||PC-Q07||$71.98, Amazon.com|
|Lian Li||PC-Q30||$139.99, Newegg.com|
|Lian Li||PC-Q33||$134.99, Newegg.com|
|Rosewill||Legacy V3 Plus-B||$59.99, Newegg.com|
The list is dominated by Lian Li, who offers a number of cube-like mini-ITX enclosures that would ordinarily be out of the question for a gaming rig, unless one of the few ITX-specific cards were chosen for the build. Many other fine enclosure makers (Antec, BitFenix, Corsair, Fractal Design, SilverStone, etc.) offer mini-ITX enclosures that support full-length GPUs, as this has pretty much become a requirement for an enthusiast PC case.
Last month NVIDIA introduced the world to the GTX 980 in a new form factor for gaming notebook. Using the same Maxwell GPU, the same performance levels but with slightly tweaked power delivery and TDPs, notebooks powered by the GTX 980 promise to be a noticeable step faster than anything before it.
Late last week I got my hands on the updated MSI GT72S Dominator Pro G, the first retail ready gaming notebook to not only integrate the new GTX 980 GPU but also an unlocked Skylake mobile processor.
This machine is something to behold - though it looks very similar to previous GT72 versions, this machine hides hardware unlike anything we have been able to carry in a backpack before. And the sexy red exterior with MSI Dragon Army logo blazoned across the back definitely help it to stand out in a crowd. If you happen to be in a crowd of notebooks.
A quick spin around the GT72S reveals a sizeable collection of hardware and connections. On the left you'll find a set of four USB 3.0 ports as well as four audio inputs and ouputs and an SD card reader.
On the opposite side there are two more USB 3.0 ports (totalling six) and the optical / Blu-ray burner. With that many USB 3.0 ports you should never struggle with accessories availability - headset, mouse, keyboard, hard drive and portable fan? Check.
GPU Enthusiasts Are Throwing a FET
NVIDIA is rumored to launch Pascal in early (~April-ish) 2016, although some are skeptical that it will even appear before the summer. The design was finalized months ago, and unconfirmed shipping information claims that chips are being stockpiled, which is typical when preparing to launch a product. It is expected to compete against AMD's rumored Arctic Islands architecture, which will, according to its also rumored numbers, be very similar to Pascal.
This architecture is a big one for several reasons.
Image Credit: WCCFTech
First, it will jump two full process nodes. Current desktop GPUs are manufactured at 28nm, which was first introduced with the GeForce GTX 680 all the way back in early 2012, but Pascal will be manufactured on TSMC's 16nm FinFET+ technology. Smaller features have several advantages, but a huge one for GPUs is the ability to fit more complex circuitry in the same die area. This means that you can include more copies of elements, such as shader cores, and do more in fixed-function hardware, like video encode and decode.
That said, we got a lot more life out of 28nm than we really should have. Chips like GM200 and Fiji are huge, relatively power-hungry, and complex, which is a terrible idea to produce when yields are low. I asked Josh Walrath, who is our go-to for analysis of fab processes, and he believes that FinFET+ is probably even more complicated today than 28nm was in the 2012 timeframe, which was when it launched for GPUs.
It's two full steps forward from where we started, but we've been tiptoeing since then.
Image Credit: WCCFTech
Second, Pascal will introduce HBM 2.0 to NVIDIA hardware. HBM 1.0 was introduced with AMD's Radeon Fury X, and it helped in numerous ways -- from smaller card size to a triple-digit percentage increase in memory bandwidth. The 980 Ti can talk to its memory at about 300GB/s, while Pascal is rumored to push that to 1TB/s. Capacity won't be sacrificed, either. The top-end card is expected to contain 16GB of global memory, which is twice what any console has. This means less streaming, higher resolution textures, and probably even left-over scratch space for the GPU to generate content in with compute shaders. Also, according to AMD, HBM is an easier architecture to communicate with than GDDR, which should mean a savings in die space that could be used for other things.
Third, the architecture includes native support for three levels of floating point precision. Maxwell, due to how limited 28nm was, saved on complexity by reducing 64-bit IEEE 754 decimal number performance to 1/32nd of 32-bit numbers, because FP64 values are rarely used in video games. This saved transistors, but was a huge, order-of-magnitude step back from the 1/3rd ratio found on the Kepler-based GK110. While it probably won't be back to the 1/2 ratio that was found in Fermi, Pascal should be much better suited for GPU compute.
Image Credit: WCCFTech
Mixed precision could help video games too, though. Remember how I said it supports three levels? The third one is 16-bit, which is half of the format that is commonly used in video games. Sometimes, that is sufficient. If so, Pascal is said to do these calculations at twice the rate of 32-bit. We'll need to see whether enough games (and other applications) are willing to drop down in precision to justify the die space that these dedicated circuits require, but it should double the performance of anything that does.
So basically, this generation should provide a massive jump in performance that enthusiasts have been waiting for. Increases in GPU memory bandwidth and the amount of features that can be printed into the die are two major bottlenecks for most modern games and GPU-accelerated software. We'll need to wait for benchmarks to see how the theoretical maps to practical, but it's a good sign.
When approached a couple of weeks ago by Microsoft with the opportunity to take an early look at an upcoming performance benchmark built on a DX12 game pending release later this year, I of course was excited for the opportunity. Our adventure into the world of DirectX 12 and performance evaluation started with the 3DMark API Overhead Feature Test back in March and was followed by the release of the Ashes of the Singularity performance test in mid-August. Both of these tests were pinpointing one particular aspect of the DX12 API - the ability to improve CPU throughput and efficiency with higher draw call counts and thus enabling higher frame rates on existing GPUs.
This game and benchmark are beautiful...
Today we dive into the world of Fable Legends, an upcoming free to play based on the world of Albion. This title will be released on the Xbox One and for Windows 10 PCs and it will require the use of DX12. Though scheduled for release in Q4 of this year, Microsoft and Lionhead Studios allowed us early access to a specific performance test using the UE4 engine and the world of Fable Legends. UPDATE: It turns out that the game will have a fall-back DX11 mode that will be enabled if the game detects a GPU incapable of running DX12.
This benchmark focuses more on the GPU side of DirectX 12 - on improved rendering techniques and visual quality rather than on the CPU scaling aspects that made Ashes of the Singularity stand out from other graphics tests we have utilized. Fable Legends is more representative of what we expect to see with the release of AAA games using DX12. Let's dive into the test and our results!