It always feels a little odd when covering NVIDIA’s quarterly earnings due to how they present their financial calendar. No, we are not reporting from the future. Yes, it can be confusing when comparing results and getting your dates mixed up. Regardless of the date before the earnings, NVIDIA did exceptionally well in a quarter that is typically the second weakest after Q1.
NVIDIA reported revenue of $1.43 billion. This is a jump from an already strong Q1 where they took in $1.30 billion. Compare this to the $1.027 billion of its competitor AMD who also provides CPUs as well as GPUs. NVIDIA sold a lot of GPUs as well as other products. Their primary money makers were the consumer space GPUs and the professional and compute markets where they have a virtual stranglehold on at the moment. The company’s GAAP net income is a very respectable $253 million.
The release of the latest Pascal based GPUs were the primary mover for the gains for this latest quarter. AMD has had a hard time competing with NVIDIA for marketshare. The older Maxwell based chips performed well against the entire line of AMD offerings and typically did so with better power and heat characteristics. Even though the GTX 970 was somewhat limited in its memory configuration as compared to the AMD products (3.5 GB + .5 GB vs. a full 4 GB implementation) it was a top seller in its class. The same could be said for the products up and down the stack.
Pascal was released at the end of May, but the company had been shipping chips to its partners as well as creating the “Founder’s Edition” models to its exacting specifications. These were strong sellers throughout the end of May until the end of the quarter. NVIDIA recently unveiled their latest Pascal based Quadro cards, but we do not know how much of an impact those have had on this quarter. NVIDIA has also been shipping, in very limited quantities, the Tesla P100 based units to select customers and outfits.
Subject: General Tech | August 17, 2016 - 12:41 PM | Jeremy Hellstrom
Tagged: nvidia, Intel, HPC, Xeon Phi, maxwell, pascal, dirty pool
There is a spat going on between Intel and NVIDIA over the slide below, as you can read about over at Ars Technica. It seems that Intel have reached into the industries bag of dirty tricks and polished off an old standby, testing new hardware and software against older products from their competitors. In this case it was high performance computing products which were tested, Intel's new Xeon Phi against NVIDIA's Maxwell, tested on an older version of the Caffe AlexNet benchmark.
NVIDIA points out that not only would they have done better than Intel if an up to date version of the benchmarking software was used, but that the comparison should have been against their current architecture, Pascal. This is not quite as bad as putting undocumented flags into compilers to reduce the performance of competitors chips or predatory discount programs but it shows that the computer industry continues to have only a passing acquaintance with fair play and honest competition.
"At this juncture I should point out that juicing benchmarks is, rather sadly, par for the course. Whenever a chip maker provides its own performance figures, they are almost always tailored to the strength of a specific chip—or alternatively, structured in such a way as to exacerbate the weakness of a competitor's product."
Here is some more Tech News from around the web:
- USB Implementers Forum introduces branding for safe USB-C charging @ The Inquirer
- Some Windows 10 Anniversary Update: SSD freeze @ The Register
- Intel Project Alloy: all-in-one VR headset takes aim at Google's Project Daydream @ The Inquirer
- Wanna build your own drone? Intel emits Linux-powered x86 brains for DIY flying gizmos @ The Register
- Intel's Optane XPoint DIMMs pushed back – source @ The Register
Realworldtech with Compelling Evidence
Yesterday David Kanter of Realworldtech posted a pretty fascinating article and video that explored the two latest NVIDIA architectures and how they have branched away from the traditional immediate mode rasterization units. It has revealed through testing that with Maxwell and Pascal NVIDIA has gone to a tiling method with rasterization. This is a somewhat significant departure for the company considering they have utilized the same basic immediate mode rasterization model since the 90s.
The Videologic Apocolypse 3Dx based on the PowerVR PCX2.
(photo courtesy of Wikipedia)
Tiling is an interesting subject and we can harken back to the PowerVR days to see where it was first implemented. There are many advantages to tiling and deferred rendering when it comes to overall efficiency in power and memory bandwidth. These first TBDR (Tile Based Deferred Renderers) offered great performance per clock and could utilize slower memory as compared to other offerings of the day (namely Voodoo Graphics). There were some significant drawbacks to the technology. Essentially a lot of work had to be done by the CPU and driver in scene setup and geometry sorting. On fast CPU systems the PowerVR boards could provide very good performance, but it suffered on lower end parts as compared to the competition. This is a very simple explanation of what is going on, but the long and short of it is that TBDR did not take over the world due to limitations in its initial implementations. Traditional immediate mode rasters would improve in efficiency and performance with aggressive Z checks and other optimizations that borrow from the TBDR playbook.
Tiling is also present in a lot of mobile parts. Imagination’s PowerVR graphics technologies have been implemented by others such as Intel, Apple, Mediatek, and others. Qualcomm (Adreno) and ARM (Mali) both implement tiler technologies to improve power consumption and performance while increasing bandwidth efficiency. Perhaps most interestingly we can remember back to the Gigapixel days with the GP-1 chip that implemented a tiling method that seemed to work very well without the CPU hit and driver overhead that had plagued the PowerVR chips up to that point. 3dfx bought Gigapixel for some $150 million at the time. That company then went on to file bankruptcy a year later and their IP was acquired by NVIDIA.
Screenshot of the program used to uncover the tiling behavior of the rasterizer.
It now appears as though NVIDIA has evolved their raster units to embrace tiling. This is not a full TBDR implementation, but rather an immediate mode tiler that will still break up the scene in tiles but does not implement deferred rendering. This change should improve bandwidth efficiency when it comes to rasterization, but it does not affect the rest of the graphics pipeline by forcing it to be deferred (tessellation, geometry setup and shaders, etc. are not impacted). NVIDIA has not done a deep dive on this change for editors, so we do not know the exact implementation and what advantages we can expect. We can look at the evidence we have and speculate where those advantages exist.
The video where David Kanter explains his findings
Bandwidth and Power
Tilers have typically taken the tiled regions and buffered them on the chip. This is a big improvement in both performance and power efficiency as the raster data does not have to be cached and written out to the frame buffer and then swapped back. This makes quite a bit of sense considering the overall lack of big jumps in memory technologies over the past five years. We have had GDDR-5 since 2007/2008. The speeds have increased over time, but the basic technology is still much the same. We have seen HBM introduced with AMD’s Fury series, but large scale production of HBM 2 is still to come. Samsung has released small amounts of HBM 2 to the market, but not nearly enough to handle the needs of a mass produced card. GDDR-5X is an extension of GDDR-5 that does offer more bandwidth, but it is still not a next generation memory technology like HBM 2.
By utilizing a tiler NVIDIA is able to lower memory bandwidth needs for the rasterization stage. Considering that both Maxwell and Pascal architectures are based on GDDR-5 and 5x technologies, it makes sense to save as much bandwidth as possible where they can. This is again probably one, among many, of the reasons that we saw a much larger L2 cache in Maxwell vs. Kepler (2048 KB vs. 256KB respectively). Every little bit helps when we are looking at hard, real world bandwidth limits for a modern GPU.
The area of power efficiency has also come up in discussion when going to a tiler. Tilers have traditionally been more power efficient as well due to how the raster data is tiled and cached, requiring fewer reads and writes to main memory. The first impulse is to say, “Hey, this is the reason why NVIDIA’s Maxwell was so much more power efficient than Kepler and AMD’s latest parts!” Sadly, this is not exactly true. The tiler is more power efficient, but it is a small part to the power savings on a GPU.
The second fastest Pascal based card...
A modern GPU is very complex. There are some 7.2 billion transistors on the latest Pascal GP-104 that powers the GTX 1080. The vast majority of those transistors are implemented in the shader units of the chip. While the raster units are very important, they are but a fraction of that transistor budget. The rest is taken up by power regulation, PCI-E controllers, and memory controllers. In the big scheme of things the raster portion is going to be dwarfed in power consumption by the shader units. This does not mean that they are not important though. Going back to the hated car analogy, one does not achieve weight savings by focusing on one aspect alone. It is going over every single part of the car and shaving ounces here and there, and in the end achieving significant savings by addressing every single piece of a complex product.
This does appear to be the long and short of it. This is one piece of a very complex ASIC that improves upon memory bandwidth utilization and power efficiency. It is not the whole story, but it is an important part. I find it interesting that NVIDIA did not disclose this change to editors with the introduction of Maxwell and Pascal, but if it is transparent to users and developers alike then there is no need. There is a lot of “secret sauce” that goes into each architecture, and this is merely one aspect. The one question that I do have is how much of the technology is based upon the Gigapixel IP that 3dfx bought at such a premium? I believe that particular tiler was an immediate mode renderer as well due to it not having as many driver and overhead issues that PowerVR exhibited back in the day. Obviously it would not be a copy/paste of the technology that was developed back in the 90s, it would be interesting to see if it was the basis for this current implementation.
NVIDIA Offers Preliminary Settlement To Geforce GTX 970 Buyers In False Advertising Class Action Lawsuit
Subject: Graphics Cards | July 28, 2016 - 07:07 PM | Tim Verry
Tagged: nvidia, maxwell, GTX 970, GM204, 3.5gb memory
A recent post on Top Class Actions suggests that buyers of NVIDIA GTX 970 graphics cards may soon see a payout from a settlement agreement as part of the series of class action lawsuits facing NVIDIA over claims of false advertising. NVIDIA has reportedly offered up a preliminary settlement of $30 to "all consumers who purchased the GTX 970 graphics card" with no cap on the total payout amount along with a whopping $1.3 million in attorney's fees.
This settlement offer is in response to several class action lawsuits that consumers filed against the graphics giant following the controversy over mis-advertised specifications (particularly the number of ROP units and amount of L2 cache) and the method in which NVIDIA's GM204 GPU addressed the four total gigabytes of graphics memory.
Specifically, the graphics card specifications initially indicated that it had 64 ROPs and 2048 KB of L2 cache, but later was revealed to have only 56 ROPs and 1792 KB of L2. On the memory front, the "3.5 GB memory controvesy" spawned many memes and investigations into how the 3.5 GB and 0.5 GB pools of memory worked and how performance both real world and theoretical were affected by the memory setup.
(My opinions follow)
It was quite the PR disaster and had NVIDIA been upfront with all the correct details on specifications and the new memory implementation the controversy could have been avoided. As is though buyers were not able to make informed decisions about the card and at the end of the day that is what is important and why the lawsuits have merit.
As such, I do expect both sides to reach a settlement rather than see this come to a full trial, but it may not be exactly the $30 per buyer payout as that amount still needs to be approved by the courts to ensure that it is "fair and reasonable."
For more background on the GTX 970 memory issue (it has been awhile since this all came about after all, so you may need a refresher):
- NVIDIA Discloses Full Memory Structure and Limitations of GTX 970
- NVIDIA Responds to GTX 970 3.5GB Memory Issue
- Frame Rating: GTX 970 Memory Issues Tested in SLI
- Frame Rating: Looking at GTX 970 Memory Performance
Subject: Graphics Cards, Systems, Mobile | July 27, 2016 - 07:58 PM | Scott Michaud
Tagged: nvidia, Nintendo, nintendo nx, tegra, Tegra X1, tegra x2, pascal, maxwell
Okay so there's a few rumors going around, mostly from Eurogamer / DigitalFoundry, that claim the Nintendo NX is going to be powered by an NVIDIA Tegra system on a chip (SoC). DigitalFoundry, specifically, cites multiple sources who claim that their Nintendo NX development kits integrate the Tegra X1 design, as seen in the Google Pixel C. That said, the Nintendo NX release date, March 2017, does provide enough time for them to switch to NVIDIA's upcoming Pascal Tegra design, rumored to be called the Tegra X2, which uses NVIDIA's custom-designed Denver CPU cores.
Preamble aside, here's what I think about the whole situation.
First, the Tegra X1 would be quite a small jump in performance over the WiiU. The WiiU's GPU, “Latte”, has 320 shaders clocked at 550 MHz, and it was based on AMD's TeraScale 1 architecture. Because these stream processors have single-cycle multiply-add for floating point values, you can get its FLOP rating by multiplying 320 shaders, 550,000,000 cycles per second, and 2 operations per clock (one multiply and one add). This yields 352 GFLOPs. The Tegra X1 is rated at 512 GFLOPs, which is just 45% more than the previous generation.
This is a very tiny jump, unless they indeed use Pascal-based graphics. If this is the case, you will likely see a launch selection of games ported from WiiU and a few games that use whatever new feature Nintendo has. One rumor is that the console will be kind-of like the WiiU controller, with detachable controllers. If this is true, it's a bit unclear how this will affect games in a revolutionary way, but we might be missing a key bit of info that ties it all together.
As for the choice of ARM over x86... well. First, this obviously allows Nintendo to choose from a wider selection of manufacturers than AMD, Intel, and VIA, and certainly more than IBM with their previous, Power-based chips. That said, it also jives with Nintendo's interest in the mobile market. They joined The Khronos Group and I'm pretty sure they've said they are interested in Vulkan, which is becoming the high-end graphics API for Android, supported by Google and others. That said, I'm not sure how many engineers exist that specialize in ARM optimization, as most mobile platforms try to abstract this as much as possible, but this could be Nintendo's attempt to settle on a standardized instruction set, and they opted for mobile over PC (versus Sony and especially Microsoft, who want consoles to follow high-end gaming on the desktop).
Why? Well that would just be speculating on speculation about speculation. I'll stop here.
Subject: Graphics Cards | July 21, 2016 - 02:04 PM | Jeremy Hellstrom
Tagged: gtx 460, gtx 760, gtx 960, gtx 1060, fermi, kepler, maxwell, pascal
Phoronix took a look at how NVIDIA's mid range cards performance on Linux has changed over the past four generations of GPU, from Fermi, through Kepler, Maxwell, and finally Pascal. CS:GO was run at 4k to push the newer GPUs as was DOTA, much to the dismay of the GTX 460. The scaling is rather interesting, there is a very large delta between Fermi and Kepler which comes close to being replicated when comparing Maxwell to Pascal. From the looks of the vast majority of the tests, the GTX 1060 will be a noticeable upgrade for Linux users no matter which previous mid range card they are currently using. We will likely see a similar article covering AMD in the near future.
"To complement yesterday's launch-day GeForce GTX 1060 Linux review, here are some more benchmark results with the various NVIDIA x60 graphics cards I have available for testing going back to the GeForce GTX 460 Fermi. If you are curious about the raw OpenGL/OpenCL/CUDA performance and performance-per-Watt for these mid-range x60 graphics cards from Fermi, Kepler, Maxwell, and Pascal, here are these benchmarks from Ubuntu 16.04 Linux." Here are some more Graphics Card articles from around the web:
- ASUS ROG STRIX-GTX1070-O8G-GAMING: GTX 1070, Strix Style! @ Bjorn3d
- MSI GeForce GTX 1060 Gaming X Review @HiTech Legion
- EVGA GeForce GTX 1070 SC Gaming ACX 3.0 Review - Affordable Enthusiast Gaming @HiTech Legion
- Radeon RX 480 performance revisited with AMD's 16.7.1 driver @ The Tech Report
- AMD Radeon RX 480 8GB CrossFire @ [H]ard|OCP
Subject: Graphics Cards | July 16, 2016 - 06:37 PM | Scott Michaud
Tagged: Volta, pascal, nvidia, maxwell, 16nm
For the past few generations, NVIDIA has been roughly trying to release a new architecture with a new process node, and release a refresh the following year. This ran into a hitch as Maxwell was delayed a year, apart from the GTX 750 Ti, and then pushed back to the same 28nm process that Kepler utilized. Pascal caught up with 16nm, although we know that some hard, physical limitations are right around the corner. The lattice spacing for silicon at room temperature is around ~0.5nm, so we're talking about features the size of ~the low 30s of atoms in width.
This rumor claims that NVIDIA is not trying to go with 10nm for Volta. Instead, it will take place on the same, 16nm node that Pascal is currently occupying. This is quite interesting, because GPUs scale quite well with complexity changes, as they have many features with a relatively low clock rate, so the only real ways to increase performance are to make the existing architecture more efficient, or make a larger chip.
That said, GP100 leaves a lot of room on the table for an FP32-optimized, ~600mm2 part to crush its performance at the high end, similar to how GM200 replaced GK110. The rumored GP102, expected in the ~450mm2 range for Titan or GTX 1080 Ti-style parts, has some room to grow. Like GM200, however, it would also be unappealing to GPU compute users who need FP64. If this is what is going on, and we're totally just speculating at the moment, it would signal that enterprise customers should expect a new GPGPU card every second gaming generation.
That is, of course, unless NVIDIA recognized ways to make the Maxwell-based architecture significantly more die-space efficient in Volta. Clocks could get higher, or the circuits themselves could get simpler. You would think that, especially in the latter case, they would have integrated those ideas into Maxwell and Pascal, though; but, like HBM2 memory, there might have been a reason why they couldn't.
We'll need to wait and see. The entire rumor could be crap, who knows?
Subject: Graphics Cards | June 21, 2016 - 05:22 PM | Scott Michaud
Tagged: nvidia, fermi, kepler, maxwell, pascal, gf100, gf110, GK104, gk110, GM204, gm200, GP104
Techspot published an article that compared eight GPUs across six, high-end dies in NVIDIA's last four architectures: Fermi to Pascal. Average frame rates were listed across nine games, each measured at three resolutions:1366x768 (~720p HD), 1920x1080 (1080p FHD), and 2560x1600 (~1440p QHD).
The results are interesting. Comparing GP104 to GF100, mainstream Pascal is typically on the order of four times faster than big Fermi. Over that time, we've had three full generational leaps in fabrication technology, leading to over twice the number of transistors packed into a die that is almost half the size. It does, however, show that prices have remained relatively constant, except that the GTX 1080 is sort-of priced in the x80 Ti category despite the die size placing it in the non-Ti class. (They list the 1080 at $600, but you can't really find anything outside the $650-700 USD range).
It would be interesting to see this data set compared against AMD. It's informative for an NVIDIA-only article, though.
NVIDIA's Ansel Technology
“In-game photography” is an interesting concept. Not too long ago, it was difficult to just capture the user's direct experience with a title. Print screen could only hold a single screenshot at a time, which allowed Steam and FRAPS to provide a better user experience. FRAPS also made video more accessible to the end-user, but it output huge files and, while it wasn't too expensive, it needed to be purchased online, which was a big issue ten-or-so years ago.
Seeing that their audience would enjoy video captures, NVIDIA introduced ShadowPlay a couple of years ago. The feature allowed users to, not only record video, but also capture the last few minutes. It did this with hardware acceleration, and it did this for free (for compatible GPUs). While I don't use ShadowPlay, preferring the control of OBS, it's a good example of how NVIDIA wants to support their users. They see these features as a value-add, which draw people to their hardware.
Subject: Graphics Cards | May 10, 2016 - 07:50 PM | Scott Michaud
Tagged: nvidia, maxwell, GTX 980 Ti, GTX 970, GTX 1080, geforce
The GTX 1080 announcement is starting to ripple into retailers, leading to price cuts on the previous generation, Maxwell-based SKUs. If you were interested in the GTX 1080, or an AMD graphics card of course, then you probably want to keep waiting. That said, you can take advantage of the discounts to get a VR-ready GPU or if you already have a Maxwell card that could use a cheap SLI buddy.
This tip comes from a NeoGAF thread. Microcenter has several cards on sale, but EVGA seems to have the biggest price cuts. This 980 Ti has dropped from $750 USD down to $499.99 (or $474.99 if you'll promise yourself to do that mail-in rebate). That's a whole third of its price slashed, and puts it about a hundred dollars under GTX 1080. Granted, it will also be slower than the GTX 1080, with 2GB less video RAM, but $100 might be worth that for you.