All | Editorial | General Tech | Graphics Cards | Networking | Motherboards | Cases and Cooling | Processors | Chipsets | Memory | Displays | Systems | Storage | Mobile | Shows and Expos
To the Max?
Much of the PC enthusiast internet, including our comments section, has been abuzz with “Asynchronous Shader” discussion. Normally, I would explain what it is and then outline the issues that surround it, but I would like to swap that order this time. Basically, the Ashes of the Singularity benchmark utilizes Asynchronous Shaders in DirectX 12, but they disable it (by Vendor ID) for NVIDIA hardware. They say that this is because, while the driver reports compatibility, “attempting to use it was an unmitigated disaster in terms of performance and conformance”.
AMD's Robert Hallock claims that NVIDIA GPUs, including Maxwell, cannot support the feature in hardware at all, while all AMD GCN graphics cards do. NVIDIA has yet to respond to our requests for an official statement, although we haven't poked every one of our contacts yet. We will certainly update and/or follow up if we hear from them. For now though, we have no idea whether this is a hardware or software issue. Either way, it seems more than just politics.
So what is it?
Simply put, Asynchronous Shaders allows a graphics driver to cram workloads in portions of the GPU that are idle, but not otherwise available. For instance, if a graphics task is hammering the ROPs, the driver would be able to toss an independent physics or post-processing task into the shader units alongside it. Kollock from Oxide Games used the analogy of HyperThreading, which allows two CPU threads to be executed on the same core at the same time, as long as it has the capacity for it.
Kollock also notes that compute is becoming more important in the graphics pipeline, and it is possible to completely bypass graphics altogether. The fixed-function bits may never go away, but it's possible that at least some engines will completely bypass it -- maybe even their engine, several years down the road.
But, like always, you will not get an infinite amount of performance by reducing your waste. You are always bound by the theoretical limits of your components, and you cannot optimize past that (except for obviously changing the workload itself). The interesting part is: you can measure that. You can absolutely observe how long a GPU is idle, and represent it as a percentage of a time-span (typically a frame).
And, of course, game developers profile GPUs from time to time...
According to Kollock, he has heard of some console developers getting up to 30% increases in performance using Asynchronous Shaders. Again, this is on console hardware and so this amount may increase or decrease on the PC. In an informal chat with a developer at Epic Games, so massive grain of salt is required, his late night ballpark “totally speculative” guesstimate is that, on the Xbox One, the GPU could theoretically accept a maximum ~10-25% more work in Unreal Engine 4, depending on the scene. He also said that memory bandwidth gets in the way, which Asynchronous Shaders would be fighting against. It is something that they are interested in and investigating, though.
This is where I speculate on drivers. When Mantle was announced, I looked at its features and said “wow, this is everything that a high-end game developer wants, and a graphics developer absolutely does not”. From the OpenCL-like multiple GPU model taking much of the QA out of SLI and CrossFire, to the memory and resource binding management, this should make graphics drivers so much easier.
It might not be free, though. Graphics drivers might still have a bunch of games to play to make sure that work is stuffed through the GPU as tightly packed as possible. We might continue to see “Game Ready” drivers in the coming years, even though much of that burden has been shifted to the game developers. On the other hand, maybe these APIs will level the whole playing field and let all players focus on chip design and efficient injestion of shader code. As always, painfully always, time will tell.
It's Basically a Function Call for GPUs
Mantle, Vulkan, and DirectX 12 all claim to reduce overhead and provide a staggering increase in “draw calls”. As mentioned in the previous editorial, loading graphics card with tasks will take a drastic change in these new APIs. With DirectX 10 and earlier, applications would assign attributes to (what it is told is) the global state of the graphics card. After everything is configured and bound, one of a few “draw” functions is called, which queues the task in the graphics driver as a “draw call”.
While this suggests that just a single graphics device is to be defined, which we also mentioned in the previous article, it also implies that one thread needs to be the authority. This limitation was known about for a while, and it contributed to the meme that consoles can squeeze all the performance they have, but PCs are “too high level” for that. Microsoft tried to combat this with “Deferred Contexts” in DirectX 11. This feature allows virtual, shadow states to be loaded from secondary threads, which can be appended to the global state, whole. It was a compromise between each thread being able to create its own commands, and the legacy decision to have a single, global state for the GPU.
Some developers experienced gains, while others lost a bit. It didn't live up to expectations.
The paradigm used to load graphics cards is the problem. It doesn't make sense anymore. A developer might not want to draw a primitive with every poke of the GPU. At times, they might want to shove a workload of simple linear algebra through it, while other requests could simply be pushing memory around to set up a later task (or to read the result of a previous one). More importantly, any thread could want to do this to any graphics device.
The new graphics APIs allow developers to submit their tasks quicker and smarter, and it allows the drivers to schedule compatible tasks better, even simultaneously. In fact, the driver's job has been massively simplified altogether. When we tested 3DMark back in March, two interesting things were revealed:
- Both AMD and NVIDIA are only a two-digit percentage of draw call performance apart
- Both AMD and NVIDIA saw an order of magnitude increase in draw calls
Tick Tock Tick Tock Tick Tock Tock
A few websites have been re-reporting on a leak from BenchLife.info about Kaby Lake, which is supposedly a second 14nm redesign (“Tock”) to be injected between Skylake and Cannonlake.
UPDATE (July 2nd, 3:20pm ET): It has been pointed out that many hoaxes have come out of the same source, and that I should be more clear in my disclaimer. This is an unconfirmed, relatively easy to fake leak that does not have a second, independent source. I reported on it because (apart from being interesting enough) some details were listed on the images, but not highlighted in the leak, such as "GT0" and a lack of Iris Pro on -K. That suggests that the leaker got the images from somewhere, but didn't notice those details, which implies that the original source was hoaxed by an anonymous source, who only seeded the hoax to a single media outlet, or that it was an actual leak.
Either way, enjoy my analysis but realize that this is a single, unconfirmed source who allegedly published hoaxes in the past.
Image Credit: BenchLife.info
If true, this would be a major shift in both Intel's current roadmap as well as how they justify their research strategies. It also includes a rough stack of product categories, from 4.5W up to 91W TDPs, including their planned integrated graphics configurations. This leads to a pair of interesting stories:
How Kaby Lake could affect Intel's processors going forward. Since 2006, Intel has only budgeted a single CPU architecture redesign for any given fabrication process node. Taking two attempts on the 14nm process buys time for 10nm to become viable, but it could also give them more time to build up a better library of circuit elements, allowing them to assemble better processors in the future.
What type of user will be given Iris Pro? Also, will graphics-free options be available in the sub-Enthusiast class? When buying a processor from Intel, the high-end mainstream processors tend to have GT2-class graphics, such as the Intel HD 4600. Enthusiast architectures, such as Haswell-E, cannot be used without discrete graphics -- the extra space is used for more cores, I/O lanes, or other features. As we will discuss later, Broadwell took a step into changing the availability of Iris Pro in the high-end mainstream, but it doesn't seem like Kaby Lake will make any more progress. Also, if I am interpreting the table correctly, Kaby Lake might bring iGPU-less CPUs to LGA 1151.
Keeping Your Core Regular
To the first point, Intel has been on a steady tick-tock cycle since the Pentium 4 architecture reached the 65nm process node, which was a “tick”. The “tock” came from the Conroe/Merom architecture that was branded “Core 2”. This new architecture was a severe departure from the high clock, relatively low IPC design that Netburst was built around, which instantaneously changed the processor landscape from a dominant AMD to an Intel runaway lead.
After 65nm and Core 2 started the cycle, every new architecture alternated between shrinking the existing architecture to smaller transistors (tick) and creating a new design on the same fabrication process (tock). Even though Intel has been steadily increasing their R&D budget over time, which is now in the range of $10 to $12 billion USD each year, creating smaller, more intricate designs with new process nodes has been getting harder. For comparison, AMD's total revenue (not just profits) for 2014 was $5.51 billion USD.
Digging in a Little Deeper into the DiRT
Over the past few weeks I have had the chance to play the early access "DiRT Rally" title from Codemasters. This is a much more simulation based title that is currently PC only, which is a big switch for Codemasters and how they usually release their premier racing offerings. I was able to get a hold of Paul Coleman from Codemasters and set up a written interview with him. Paul's answers will be in italics.
Who are you, what do you do at Codemasters, and what do you do in your spare time away from the virtual wheel?
Hi my name is Paul Coleman and I am the Chief Games Designer on DiRT Rally. I’m responsible for making sure that the game is the most authentic representation of the sport it can be, I’m essentially representing the player in the studio. In my spare time I enjoy going on road trips with my family in our 1M Coupe. I’ve been co-driving in real world rally events for the last three years and I’ve used that experience to write and voice the co-driver calls in game.
If there is one area that DiRT has really excelled at is keeping frame rate consistent throughout multiple environments. Many games, especially those using cutting edge rendering techniques, often have dramatic frame rate drops at times. How do you get around this while still creating a very impressive looking game?
The engine that DiRT Rally has been built on has been constantly iterated on over the years and we have always been looking at ways of improving the look of the game while maintaining decent performance. That together with the fact that we work closely with GPU manufacturers on each project ensures that we stay current. We also have very strict performance monitoring systems that have come from optimising games for console. These systems have proved very useful when building DiRT Rally even though the game is exclusively on PC.
How do you balance out different controller use cases? While many hard core racers use a wheel, I have seen very competitive racing from people using handheld controllers as well as keyboards. Do you handicap/help those particular implementations so as not to make it overly frustrating to those users? I ask due to the difference in degrees of precision that a gamepad has vs. a wheel that can rotate 900 degrees.
Again this comes back to the fact that we have traditionally developed for console where the primary input device is a handheld controller. This is an area that other sims don’t usually have to worry about but for us it was second nature. There are systems that we have that add a layer between the handheld controller or keyboard and the game which help those guys but the wheel is without a doubt the best way to experience DiRT Rally as it is a direct input.
Process Technology Overview
We have been very spoiled throughout the years. We likely did not realize exactly how spoiled we were until it became very obvious that the rate of process technology advances hit a virtual brick wall. Every 18 to 24 months we were treated to a new, faster, more efficient process node that was opened up to fabless semiconductor firms and we were treated to a new generation of products that would blow our hair back. Now we have been in a virtual standstill when it comes to new process nodes from the pure-play foundries.
Few expected the 28 nm node to live nearly as long as it has. Some of the first cracks in the façade actually came from Intel. Their 22 nm Tri-Gate (FinFET) process took a little bit longer to get off the ground than expected. We also noticed some interesting electrical features from the products developed on that process. Intel skewed away from higher clockspeeds and focused on efficiency and architectural improvements rather than staying at generally acceptable TDPs and leapfrogging the competition by clockspeed alone. Overclockers noticed that the newer parts did not reach the same clockspeed heights as previous products such as the 32 nm based Sandy Bridge processors. Whether this decision was intentional from Intel or not is debatable, but my gut feeling here is that they responded to the technical limitations of their 22 nm process. Yields and bins likely dictated the max clockspeeds attained on these new products. So instead of vaulting over AMD’s products, they just slowly started walking away from them.
Samsung is one of the first pure-play foundries to offer a working sub-20 nm FinFET product line. (Photo courtesy of ExtremeTech)
When 28 nm was released the plans on the books were to transition to 20 nm products based on planar transistors, thereby bypassing the added expense of developing FinFETs. It was widely expected that FinFETs were not necessarily required to address the needs of the market. Sadly, that did not turn out to be the case. There are many other factors as to why 20 nm planar parts are not common, but the limitations of that particular process node has made it a relatively niche process node that is appropriate for smaller, low power ASICs (like the latest Apple SOCs). The Apple A8 is rumored to be around 90 mm square, which is a far cry from the traditional midrange GPU that goes from 250 mm sq. to 400+ mm sq.
The essential difficulty of the 20 nm planar node appears to be a lack of power scaling to match the increased transistor density. TSMC and others have successfully packed in more transistors into every square mm as compared to 28 nm, but the electrical characteristics did not scale proportionally well. Yes, there are improvements there per transistor, but when designers pack in all those transistors into a large design, TDP and voltage issues start to arise. As TDP increases, it takes more power to drive the processor, which then leads to more heat. The GPU guys probably looked at this and figured out that while they can achieve a higher transistor density and a wider design, they will have to downclock the entire GPU to hit reasonable TDP levels. When adding these concerns to yields and bins for the new process, the advantages of going to 20 nm would be slim to none at the end of the day.
Project Lead: Joris-Jan van ‘t Land
Thanks to Ian Comings, guest writer from the PC Perspective Forums who conducted the interview of Bohemia Interactive's Joris-Jan van ‘t Land. If you are interested in learning more about ArmA 3 and hanging out with some PC gamers to play it, check out the PC Perspective Gaming Forum!
I recently got the chance to send some questions to Bohemia Interactive, a computer game development company based out of Prague, Czech Republic, and a member of IDEA Games. Bohemia Interactive was founded in 1999 by CEO Marek Španěl, and it is best known for PC gaming gems like Operation Flashpoint: Cold War Crisis, The ArmA series, Take On Helicopters, and DayZ. The questions are answered by ArmA 3's Project Lead: Joris-Jan van ‘t Land.
PC Perspective: How long have you been at Bohemia Interactive?
VAN ‘T LAND: All in all, about 14 years now.
PC Perspective: What inspired you to become a Project Lead at Bohemia Interactive?
VAN ‘T LAND: During high school, it was pretty clear to me that I wanted to work in game development, and just before graduation, a friend and I saw a first preview for Operation Flashpoint: Cold War Crisis in a magazine. It immediately looked amazing to us; we were drawn to the freedom and diversity it promised and the military theme. After helping run a fan website (Operation Flashpoint Network) for a while, I started to assist with part-time external design work on the game (scripting and scenario editing). From that point, I basically grew naturally into this role at Bohemia Interactive.
PC Perspective: What part of working at Bohemia Interactive do you find most satisfying? What do you find most challenging?
VAN ‘T LAND: The amount of freedom and autonomy is very satisfying. If you can demonstrate skills in some area, you're welcome to come up with random ideas and roll with them. Some of those ideas can result in official releases, such as Arma 3 Zeus. Another rewarding aspect is the near real-time connection to those people who are playing the game. Our daily Dev-Branch release means the work I do on Monday is live on Tuesday. Our own ambitions, on the other hand, can sometimes result in some challenges. We want to do a lot and incorporate every aspect of combat in Arma, but we're still a relatively small team. This can mean we bite off more than we can deliver at an acceptable level of quality.
PC Perspective: What are some of the problems that have plagued your team, and how have they been overcome?
VAN ‘T LAND: One key problem for us was that we had no real experience with developing a game in more than one physical location. For Arma 3, our team was split over two main offices, which caused quite a few headaches in terms of communication and data synchronization. We've since had more key team members travel between the offices more frequently and improved our various virtual communication methods. A lot of work has been done to try to ensure that both offices have the latest version of the game at any given time. That is not always easy when your bandwidth is limited and games are getting bigger and bigger.
We’ve been tracking NVIDIA’s G-Sync for quite a while now. The comments section on Ryan’s initial article erupted with questions, and many of those were answered in a follow-on interview with NVIDIA’s Tom Petersen. The idea was radical – do away with the traditional fixed refresh rate and only send a new frame to the display when it has just completed rendering by the GPU. There are many benefits here, but the short version is that you get the low-latency benefit of V-SYNC OFF gaming combined with the image quality (lack of tearing) that you would see if V-SYNC was ON. Despite the many benefits, there are some potential disadvantages that come from attempting to drive an LCD panel at varying periods of time, as opposed to the fixed intervals that have been the norm for over a decade.
As the first round of samples came to us for review, the current leader appeared to be the ASUS ROG Swift. A G-Sync 144 Hz display at 1440P was sure to appeal to gamers who wanted faster response than the 4K 60 Hz G-Sync alternative was capable of. Due to what seemed to be large consumer demand, it has taken some time to get these panels into the hands of consumers. As our Storage Editor, I decided it was time to upgrade my home system, placed a pre-order, and waited with anticipation of finally being able to shift from my trusty Dell 3007WFP-HC to a large panel that can handle >2x the FPS.
Fast forward to last week. My pair of ROG Swifts arrived, and some other folks I knew had also received theirs. Before I could set mine up and get some quality gaming time in, my bro FifthDread and his wife both noted a very obvious flicker on their Swifts within the first few minutes of hooking them up. They reported the flicker during game loading screens and mid-game during background content loading occurring in some RTS titles. Prior to hearing from them, the most I had seen were some conflicting and contradictory reports on various forums (not limed to the Swift, though that is the earliest panel and would therefore see the majority of early reports), but now we had something more solid to go on. That night I fired up my own Swift and immediately got to doing what I do best – trying to break things. We have reproduced the issue and intend to demonstrate it in a measurable way, mostly to put some actual data out there to go along with those trying to describe something that is borderline perceptible for mere fractions of a second.
First a bit of misnomer correction / foundation laying:
- The ‘Screen refresh rate’ option you see in Windows Display Properties is actually a carryover from the CRT days. In terms of an LCD, it is the maximum rate at which a frame is output to the display. It is not representative of the frequency at which the LCD panel itself is refreshed by the display logic.
- LCD panel pixels are periodically updated by a scan, typically from top to bottom. Newer / higher quality panels repeat this process at a rate higher than 60 Hz in order to reduce the ‘rolling shutter’ effect seen when panning scenes or windows across the screen.
- In order to engineer faster responding pixels, manufacturers must deal with the side effect of faster pixel decay between refreshes. This is a balanced by increasing the frequency of scanning out to the panel.
- The effect we are going to cover here has nothing to do with motion blur, LightBoost, backlight PWM, LightBoost combined with G-Sync (not currently a thing, even though Blur Busters has theorized on how it could work, their method would not work with how G-Sync is actually implemented today).
With all of that out of the way, let’s tackle what folks out there may be seeing on their own variable refresh rate displays. Based on our testing so far, the flicker only presented at times when a game enters a 'stalled' state. These are periods where you would see a split-second freeze in the action, like during a background level load during game play in some titles. It also appears during some game level load screens, but as those are normally static scenes, they would have gone unnoticed on fixed refresh rate panels. Since we were absolutely able to see that something was happening, we wanted to be able to catch it in the act and measure it, so we rooted around the lab and put together some gear to do so. It’s not a perfect solution by any means, but we only needed to observe differences between the smooth gaming and the ‘stalled state’ where the flicker was readily observable. Once the solder dust settled, we fired up a game that we knew could instantaneously swing from a high FPS (144) to a stalled state (0 FPS) and back again. As it turns out, EVE Online does this exact thing while taking an in-game screen shot, so we used that for our initial testing. Here’s what the brightness of a small segment of the ROG Swift does during this very event:
Measured panel section brightness over time during a 'stall' event. Click to enlarge.
The relatively small ripple to the left and right of center demonstrate the panel output at just under 144 FPS. Panel redraw is in sync with the frames coming from the GPU at this rate. The center section, however, represents what takes place when the input from the GPU suddenly drops to zero. In the above case, the game briefly stalled, then resumed a few frames at 144, then stalled again for a much longer period of time. Completely stopping the panel refresh would result in all TN pixels bleeding towards white, so G-Sync has a built-in failsafe to prevent this by forcing a redraw every ~33 msec. What you are seeing are the pixels intermittently bleeding towards white and periodically being pulled back down to the appropriate brightness by a scan. The low latency panel used in the ROG Swift does this all of the time, but it is less noticeable at 144, as you can see on the left and right edges of the graph. An additional thing that’s happening here is an apparent rise in average brightness during the event. We are still researching the cause of this on our end, but this brightness increase certainly helps to draw attention to the flicker event, making it even more perceptible to those who might have not otherwise noticed it.
Some of you might be wondering why this same effect is not seen when a game drops to 30 FPS (or even lower) during the course of normal game play. While the original G-Sync upgrade kit implementation simply waited until 33 msec had passed until forcing an additional redraw, this introduced judder from 25-30 FPS. Based on our observations and testing, it appears that NVIDIA has corrected this in the retail G-Sync panels with an algorithm that intelligently re-scans at even multiples of the input frame rate in order to keep the redraw rate relatively high, and therefore keeping flicker imperceptible – even at very low continuous frame rates.
A few final points before we go:
- This is not limited to the ROG Swift. All variable refresh panels we have tested (including 4K) see this effect to a more or less degree than reported here. Again, this only occurs when games instantaneously drop to 0 FPS, and not when those games dip into low frame rates in a continuous fashion.
- The effect is less perceptible (both visually and with recorded data) at lower maximum refresh rate settings.
- The effect is not present at fixed refresh rates (G-Sync disabled or with non G-Sync panels).
This post was primarily meant as a status update and to serve as something for G-Sync users to point to when attempting to explain the flicker they are perceiving. We will continue researching, collecting data, and coordinating with NVIDIA on this issue, and will report back once we have more to discuss.
During the research and drafting of this piece, we reached out to and worked with NVIDIA to discuss this issue. Here is their statement:
"All LCD pixel values relax after refreshing. As a result, the brightness value that is set during the LCD’s scanline update slowly relaxes until the next refresh.
This means all LCDs have some slight variation in brightness. In this case, lower frequency refreshes will appear slightly brighter than high frequency refreshes by 1 – 2%.
When games are running normally (i.e., not waiting at a load screen, nor a screen capture) - users will never see this slight variation in brightness value. In the rare cases where frame rates can plummet to very low levels, there is a very slight brightness variation (barely perceptible to the human eye), which disappears when normal operation resumes."
So there you have it. It's basically down to the physics of how an LCD panel works at varying refresh rates. While I agree that it is a rare occurrence, there are some games that present this scenario more frequently (and noticeably) than others. If you've noticed this effect in some games more than others, let us know in the comments section below.
(Editor's Note: We are continuing to work with NVIDIA on this issue and hope to find a way to alleviate the flickering with either a hardware or software change in the future.)
It has become increasingly apparent that flash memory die shrinks have hit a bit of a brick wall in recent years. The issues faced by the standard 2D Planar NAND process were apparent very early on. This was no real secret - here's a slide seen at the 2009 Flash Memory Summit:
Despite this, most flash manufacturers pushed the envelope as far as they could within the limits of 2D process technology, balancing shrinks with reliability and performance. One of the largest flash manufacturers was Intel, having joined forces with Micron in a joint venture dubbed IMFT (Intel Micron Flash Technologies). Intel remained in lock-step with Micron all the way up to 20nm, but chose to hold back at the 16nm step, presumably in order to shift full focus towards alternative flash technologies. This was essentially confirmed late last week, with Intel's announcement of a shift to 3D NAND production.
Intel's press briefing seemed to focus more on cost efficiency than performance, and after reviewing the very few specs they released about this new flash, I believe we can do some theorizing as to the potential performance of this new flash memory. From the above illustration, you can see that Intel has chosen to go with the same sort of 3D technology used by Samsung - a 32 layer vertical stack of flash cells. This requires the use of an older / larger process technology, as it is too difficult to etch these holes at a 2x nm size. What keeps the die size reasonable is the fact that you get a 32x increase in bit density. Going off of a rough approximation from the above photo, imagine that 50nm die (8 Gbit), but with 32 vertical NAND layers. That would yield a 256 Gbit (32 GB) die within roughly the same footprint.
Representation of Samsung's 3D VNAND in 128Gbit and 86 Gbit variants.
20nm planar (2D) = yellow square, 16nm planar (2D) = blue square.
Image republished with permission from Schiltron Corporation.
It's likely a safe bet that IMFT flash will be going for a cost/GB far cheaper than the competing Samsung VNAND, and going with a relatively large 256 Gbit (vs. VNAND's 86 Gbit) per-die capacity is a smart move there, but let's not forget that there is a catch - write speed. Most NAND is very fast on reads, but limited on writes. Shifting from 2D to 3D NAND netted Samsung a 2x speed boost per die, and another effective 1.5x speed boost due to their choice to reduce per-die capacity from 128 Gbit to 86 Gbit. This effective speed boost came from the fact that a given VNAND SSD has 50% more dies to reach the same capacity as an SSD using 128 Gbit dies.
Now let's examine how Intel's choice of a 256 Gbit die impacts performance:
- Intel SSD 730 240GB = 16x128 Gbit 20nm dies
- 270 MB/sec writes and ~17 MB/sec/die
- Crucial MX100 128GB = 8x128Gbit 16nm dies
- 150 MB/sec writes and ~19 MB/sec/die
- Samsung 850 Pro 128GB = 12x86Gbit VNAND dies
- 470MB/sec writes and ~40 MB/sec/die
If we do some extrapolation based on the assumption that IMFT's move to 3D will net the same ~2x write speed improvement seen by Samsung, combined with their die capacity choice of 256Gbit, we get this:
- Future IMFT 128GB SSD = 4x256Gbit 3D dies
- 40 MB/sec/die x 4 dies = 160MB/sec
Even rounding up to 40 MB/sec/die, we can see that also doubling the die capacity effectively negates the performance improvement. While the IMFT flash equipped SSD will very likely be a lower cost product, it will (theoretically) see the same write speed limits seen in today's SSDs equipped with IMFT planar NAND. Now let's go one layer deeper on theoretical products and assume that Intel took the 18-channel NVMe controller from their P3700 Series and adopted it to a consumer PCIe SSD using this new 3D NAND. The larger die size limits the minimum capacity you can attain and still fully utilize their 18 channel controller, so with one die per channel, you end up with this product:
- Theoretical 18 channel IMFT PCIE 3D NAND SSD = 18x256Gbit 3D dies
- 40 MB/sec/die x 18 dies = 720 MB/sec
- 18x32GB (die capacity) = 576GB total capacity
Overprovisioning decisions aside, the above would be the lowest capacity product that could fully utilize the Intel PCIe controller. While the write performance is on the low side by PCIe SSD standards, the cost of such a product could easily be in the $0.50/GB range, or even less.
In summary, while we don't have any solid performance data, it appears that Intel's new 3D NAND is not likely to lead to a performance breakthrough in SSD speeds, but their choice on a more cost-effective per-die capacity for their new 3D NAND is likely to give them significant margins and the wiggle room to offer SSDs at a far lower cost/GB than we've seen in recent years. This may be the step that was needed to push SSD costs into a range that can truly compete with HDD technology.
Investigating the issue
** Edit ** (24 Sep)
We have updated this story with temperature effects on the read speed of old data. Additional info on page 3.
** End edit **
** Edit 2 ** (26 Sep)
New quote from Samsung:
"We acknowledge the recent issue associated with the Samsung 840 EVO SSDs and are qualifying a firmware update to address the issue. While this issue only affects a small subset of all 840 EVO users, we regret any inconvenience experienced by our customers. A firmware update that resolves the issue will be available on the Samsung SSD website soon. We appreciate our customer’s support and patience as we work diligently to resolve this issue."
** End edit 2 **
** Edit 3 **
The firmware update and performance restoration tool has been tested. Results are found here.
** End edit 3 **
Over the past week or two, there have been growing rumblings from owners of Samsung 840 and 840 EVO SSDs. A few reports scattered across internet forums gradually snowballed into lengthy threads as more and more people took a longer look at their own TLC-based Samsung SSD's performance. I've spent the past week following these threads, and the past few days evaluating this issue on the 840 and 840 EVO samples we have here at PC Perspective. This post is meant to inform you of our current 'best guess' as to just what is happening with these drives, and just what you should do about it.
The issue at hand is an apparent slow down in the reading of 'stale' data on TLC-based Samsung SSDs. Allow me to demonstrate:
You might have seen what looks like similar issues before, but after much research and testing, I can say with some confidence that this is a completely different and unique issue. The old X25-M bug was the result of random writes to the drive over time, but the above result is from a drive that only ever saw a single large file write to a clean drive. The above drive was the very same 500GB 840 EVO sample used in our prior review. It did just fine in that review, and at afterwards I needed a quick temporary place to put a HDD image file and just happened to grab that EVO. The file was written to the drive in December of 2013, and if it wasn't already apparent from the above HDTach pass, it was 442GB in size. This brings on some questions:
- If random writes (i.e. flash fragmentation) are not causing the slow down, then what is?
- How long does it take for this slow down to manifest after a file is written?
Since the introduction of the Haswell line of CPUs, the Internet has been aflame with how hot the CPUs run. Speculation ran rampant on the cause with theories abounding about the lesser surface area and inferior thermal interface material (TIM) in between the CPU die surface and the underside of the CPU heat spreader. It was later confirmed that Intel had changed the TIM interfacing the CPU die surface to the heat spreader with Haswell, leading to the hotter than expected CPU temperatures. This increase in temperature led to inconsistent core-to-core temperatures as well as vastly inferior overclockability of the Haswell K-series chips over previous generations.
A few of the more adventurous enthusiasts took it upon themselves to use inventive ways to address the heat concerns surrounding the Haswell by delidding the processor. The delidding procedure involves physically removing the heat spreader from the CPU, exposing the CPU die. Some individuals choose to clean the existing TIM from the core die and heat spreader underside, applying superior TIM such as metal or diamond-infused paste or even the Coollaboratory Liquid Ultra metal material and fixing the heat spreader back in place. Others choose a more radical solution, removing the heat spreader from the equation entirely for direct cooling of the naked CPU die. This type of cooling method requires use of a die support plate, such as the MSI Die Guard included with the MSI Z97 XPower motherboard.
Whichever outcome you choose, you must first remove the heat spreader from the CPU's PCB. The heat spreader itself is fixed in place with black RTV-type material ensuring a secure and air-tight seal, protecting the fragile die from outside contaminants and influences. Removal can be done in multiple ways with two of the most popular being the razor blade method and the vise method. With both methods, you are attempting to separate the CPU PCB from the heat spreader without damaging the CPU die or components on the top or bottom sides of the CPU PCB.
When Magma Freezes Over...
Intel confirms that they have approached AMD about access to their Mantle API. The discussion, despite being clearly labeled as "an experiment" by an Intel spokesperson, was initiated by them -- not AMD. According to AMD's Gaming Scientist, Richard Huddy, via PCWorld, AMD's response was, "Give us a month or two" and "we'll go into the 1.0 phase sometime this year" which only has about five months left in it. When the API reaches 1.0, anyone who wants to participate (including hardware vendors) will be granted access.
AMD inside Intel Inside???
I do wonder why Intel would care, though. Intel has the fastest per-thread processors, and their GPUs are not known to be workhorses that are held back by API call bottlenecks, either. Of course, that is not to say that I cannot see any reason, however...
The AMD Argument
Earlier this week, a story was posted in a Forbes.com blog that dove into the idea of NVIDIA GameWorks and how it was doing a disservice not just on the latest Ubisoft title Watch_Dogs but on PC gamers in general. Using quotes from AMD directly, the author claims that NVIDIA is actively engaging in methods to prevent game developers from optimizing games for AMD graphics hardware. This is an incredibly bold statement and one that I hope AMD is not making lightly. Here is a quote from the story:
Gameworks represents a clear and present threat to gamers by deliberately crippling performance on AMD products (40% of the market) to widen the margin in favor of NVIDIA products. . . . Participation in the Gameworks program often precludes the developer from accepting AMD suggestions that would improve performance directly in the game code—the most desirable form of optimization.
The example cited on the Forbes story is the recently released Watch_Dogs title, which appears to show favoritism towards NVIDIA GPUs with performance of the GTX 770 ($369) coming close the performance of a Radeon R9 290X ($549).
It's evident that Watch Dogs is optimized for Nvidia hardware but it's staggering just how un-optimized it is on AMD hardware.
Watch_Dogs is the latest GameWorks title released this week.
I decided to get in touch with AMD directly to see exactly what stance the company was attempting to take with these kinds of claims. No surprise, AMD was just as forward with me as they appeared to be in the Forbes story originally.
The AMD Stance
Central to AMD’s latest annoyance with the competition is the NVIDIA GameWorks program. First unveiled last October during a press event in Montreal, GameWorks combines several NVIDIA built engine functions into libraries that can be utilized and accessed by game developers to build advanced features into games. NVIDIA’s website claims that GameWorks is “easy to integrate into games” while also including tutorials and tools to help quickly generate content with the software set. Included in the GameWorks suite are tools like VisualFX which offers rendering solutions like HBAO+, TXAA, Depth of Field, FaceWorks, HairWorks and more. Physics tools include the obvious like PhysX while also adding clothing, destruction, particles and more.
Taking it all the way to 12!
Microsoft has been developing DirectX for around 20 years now. Back in the 90s, the hardware and software scene for gaming was chaotic, at best. We had wonderful things like “SoundBlaster compatibility” and 3rd party graphics APIs such as Glide, S3G, PowerSGL, RRedline, and ATICIF. OpenGL was aimed more towards professional applications and it took John Carmack and iD, through GLQuake in 1996, to start the ball moving in that particular direction. There was a distinct need to provide standards across audio and 3D graphics that would de-fragment the industry and developers. DirectX was introduced with Windows 95, but the popularity of Direct3D did not really take off until DirectX 3.0 that was released in late 1996.
DirectX has had some notable successes, and some notable let downs, over the years. DX6 provided a much needed boost in 3D graphics, while DX8 introduced the world to programmable shading. DX9 was the most long-lived version, thanks to it being the basis for the Xbox 360 console with its extended lifespan. DX11 added in a bunch of features and made programming much simpler, all the while improving performance over DX10. The low points? DX10 was pretty dismal due to the performance penalty on hardware that supported some of the advanced rendering techniques. DirectX 7 was around a little more than a year before giving way to DX8. DX1 and DX2? Yeah, those were very unpopular and problematic, due to the myriad changes in a modern operating system (Win95) as compared to the DOS based world that game devs were used to.
Some four years ago, if going by what NVIDIA has said, initial talks were initiated to start pursuing the development of DirectX 12. DX11 was released in 2009 and has been an excellent foundation for PC games. It is not perfect, though. There is still a significant impact in potential performance due to a variety of factors, including a fairly inefficient hardware abstraction layer that relies more upon fast single threaded performance from a CPU rather than leveraging the power of a modern multi-core/multi-thread unit. This has the result of limiting how many objects can be represented on screen as well as different operations that would bottleneck even the fastest CPU threads.
Introduction and Background
Back in 2010, Intel threw a bit of a press thing for a short list of analysts and reviewers out at their IMFT flash memory plant at Lehi, Utah. The theme and message of that event was to announce 25nm flash entering mass production. A few years have passed, and 25nm flash is fairly ubiquitous, with 20nm rapidly gaining as IMFT scales production even higher with the smaller process. Last week, Intel threw a similar event, but instead of showing off a die shrink or even announcing a new enthusiast SSD, they chose to take a step back and brief us on the various design, engineering, and validation testing of their flash storage product lines.
At the Lehi event, I did my best to make off with a 25nm wafer.
Many topics were covered at this new event at the Intel campus at Folsom, CA, and over the coming weeks we will be filling you in on many of them as we take the necessary time to digest the fire hose of intel (pun intended) that we received. Today I'm going to lay out one of the more impressive things I saw at the briefings, and that is the process Intel goes through to ensure their products are among the most solid and reliable in the industry.
It wouldn’t be February if we didn’t hear the Q4 FY14 earnings from NVIDIA! NVIDIA does have a slightly odd way of expressing their quarters, but in the end it is all semantics. They are not in fact living in the future, but I bet their product managers wish they could peer into the actual Q4 2014. No, the whole FY14 thing relates back to when they made their IPO and how they started reporting. To us mere mortals, Q4 FY14 actually represents Q4 2013. Clear as mud? Lord love the Securities and Exchange Commission and their rules.
The past quarter was a pretty good one for NVIDIA. They came away with $1.144 billion in gross revenue and had a GAAP net income of $147 million. This beat the Street’s estimate by a pretty large margin. As a response, trading of NVIDIA’s stock has gone up in after hours. This has certainly been a trying year for NVIDIA and the PC market in general, but they seem to have come out on top.
NVIDIA beat estimates primarily on the strength of the PC graphics division. Many were focusing on the apparent decline of the PC market and assumed that NVIDIA would be dragged down by lower shipments. On the contrary, it seems as though the gaming market and add-in sales on the PC helped to solidify NVIDIA’s quarter. We can look at a number of factors that likely contributed to this uptick for NVIDIA.
Why is there a bourbon review on a PC-centric website?
We can’t live, eat, and breathe PC technology all the time. All of us have outside interests that may not intersect with the PC and mobile market. I think we would be pretty boring people if that were the case. Yes, our professional careers are centered in this area, but our personal lives do diverge from the PC world. You certainly can’t drink a GPU, though I’m sure somebody out there has tried.
The bottle is unique to Wyoming Whiskey. The bourbon has a warm, amber glow about it as well. Picture courtesy of Wyoming Whiskey
Many years ago I became a beer enthusiast. I loved to sample different concoctions, I would brew my own, and I settled on some personal favorites throughout the years. Living in Wyoming is not necessarily conducive to sampling many different styles and types of beers, and so I was in a bit of a rut. A few years back a friend of mine bought me a bottle of Tomatin 12 year single malt scotch, and I figured this would be an interesting avenue to move down since I had tapped out my selection of new and interesting beers (Wyoming has terrible beer distribution).
The Really Good Times are Over
We really do not realize how good we had it. Sure, we could apply that to budget surpluses and the time before the rise of global terrorism, but in this case I am talking about the predictable advancement of graphics due to both design expertise and improvements in process technology. Moore’s law has been exceptionally kind to graphics. We can look back and when we plot the course of these graphics companies, they have actually outstripped Moore in terms of transistor density from generation to generation. Most of this is due to better tools and the expertise gained in what is still a fairly new endeavor as compared to CPUs (the first true 3D accelerators were released in the 1993/94 timeframe).
The complexity of a modern 3D chip is truly mind-boggling. To get a good idea of where we came from, we must look back at the first generations of products that we could actually purchase. The original 3Dfx Voodoo Graphics was comprised of a raster chip and a texture chip, each contained approximately 1 million transistors (give or take) and were made on a then available .5 micron process (we shall call it 500 nm from here on out to give a sense of perspective with modern process technology). The chips were clocked between 47 and 50 MHz (though often could be clocked up to 57 MHz by going into the init file and putting in “SET SST_GRXCLK=57”… btw, SST stood for Sellers/Smith/Tarolli, the founders of 3Dfx). This revolutionary graphics card at the time could push out 47 to 50 megapixels and had 4 MB of VRAM and was released in the beginning of 1996.
My first 3D graphics card was the Orchid Righteous 3D. Voodoo Graphics was really the first successful consumer 3D graphics card. Yes, there were others before it, but Voodoo Graphics had the largest impact of them all.
In 1998 3Dfx released the Voodoo 2, and it was a significant jump in complexity from the original. These chips were fabricated on a 350 nm process. There were three chips to each card, one of which was the raster chip and the other two were texture chips. At the top end of the product stack was the 12 MB cards. The raster chip had 4 MB of VRAM available to it while each texture chip had 4 MB of VRAM for texture storage. Not only did this product double performance from the Voodoo Graphics, it was able to run in single card configurations at 800x600 (as compared to the max 640x480 of the Voodoo Graphics). This is the same time as when NVIDIA started to become a very aggressive competitor with the Riva TnT and ATI was about to ship the Rage 128.
A new generation of Software Rendering Engines.
We have been busy with side projects, here at PC Perspective, over the last year. Ryan has nearly broken his back rating the frames. Ken, along with running the video equipment and "getting an education", developed a hardware switching device for Wirecase and XSplit.
My project, "Perpetual Motion Engine", has been researching and developing a GPU-accelerated software rendering engine. Now, to be clear, this is just in very early development for the moment. The point is not to draw beautiful scenes. Not yet. The point is to show what OpenGL and DirectX does and what limits are removed when you do the math directly.
Errata: BioShock uses a modified Unreal Engine 2.5, not 3.
In the above video:
- I show the problems with graphics APIs such as DirectX and OpenGL.
- I talk about what those APIs attempt to solve, finding color values for your monitor.
- I discuss the advantages of boiling graphics problems down to general mathematics.
- Finally, I prove the advantages of boiling graphics problems down to general mathematics.
I would recommend watching the video, first, before moving forward with the rest of the editorial. A few parts need to be seen for better understanding.
If Microsoft was left to their own devices...
Microsoft's Financial Analyst Meeting 2013 set the stage, literally, for Steve Ballmer's last annual keynote to investors. The speech promoted Microsoft, its potential, and its unique position in the industry. He proclaims, firmly, their desire to be a devices and services company.
The explanation, however, does not befit either industry.
Ballmer noted, early in the keynote, how Bing is the only notable competitor to Google Search. He wanted to make it clear, to investors, that Microsoft needs to remain in the search business to challenge Google. The implication is that Microsoft can fill the cracks where Google does not, or even cannot, and establish a business from that foothold. I agree. Proprietary products (which are not inherently bad by the way), as Google Search is, require one or more rivals to fill the overlooked or under-served niches. A legitimate business can be established from that basis.
It is the following, similar, statement which troubles me.
Ballmer later mentioned, along the same vein, how Microsoft is among the few making fundamental operating system investments. Like search, the implication is that operating systems are proprietary products which must compete against one another. This, albeit subtly, does not match their vision as a devices and services company. The point of a proprietary platform is to own the ecosystem, from end to end, and to derive your value from that control. The product is not a device; the product is not a service; the product is a platform. This makes sense to them because, from birth, they were a company which sold platforms.
A platform as a product is not a device nor is it service.
Retiring the Workhorses
There is an inevitable shift coming. Honestly, this has been quite obvious for some time, but it has just taken AMD a bit longer to get here than many have expected. Some years back we saw AMD release their new motto, “The Future is Fusion”. While many thought it somewhat interesting and trite, it actually foreshadowed the massive shift from monolithic CPU cores to their APUs. Right now AMD’s APUs are doing “ok” in desktops and are gaining traction in mobile applications. What most people do not realize is that AMD will be going all APU all the time in the very near future.
We can look over the past few years and see that AMD has been headed in this direction for some time, but they simply have not had all the materials in place to make this dramatic shift. To get a better understanding of where AMD is heading, how they plan to address multiple markets, and what kind of pressures they are under, we have to look at the two major non-APU markets that AMD is currently hanging onto by a thread. In some ways, timing has been against AMD, not to mention available process technologies.