Manufacturer: NVIDIA

First, Some Background

 
TL;DR:
NVIDIA's Rumored GP102
 
Based on two rumors, NVIDIA seems to be planning a new GPU, called GP102, that sits between GP100 and GP104. This changes how their product stack flowed since Fermi and Kepler. GP102's performance, both single-precision and double-precision, will likely signal NVIDIA's product plans going forward.
  • - GP100's ideal 1 : 2 : 4 FP64 : FP32 : FP16 ratio is inefficient for gaming
  • - GP102 either extends GP104's gaming lead or bridges GP104 and GP100
  • - If GP102 is a bigger GP104, the future is unclear for smaller GPGPU devs
    • This is, unless GP100 can be significantly up-clocked for gaming.
  • - If GP102 matches (or outperforms) GP100 in gaming, and has better than 1 : 32 double-precision performance, then GP100 would be the first time that NVIDIA designed an enterprise-only, high-end GPU.
 

 

When GP100 was announced, Josh and I were discussing, internally, how it would make sense in the gaming industry. Recently, an article on WCCFTech cited anonymous sources, which should always be taken with a dash of salt, that claimed NVIDIA was planning a second architecture, GP102, between GP104 and GP100. As I was writing this editorial about it, relating it to our own speculation about the physics of Pascal, VideoCardz claims to have been contacted by the developers of AIDA64, seemingly on-the-record, also citing a GP102 design.

I will retell chunks of the rumor, but also add my opinion to it.

nvidia-titan-black-1.jpg

In the last few generations, each architecture had a flagship chip that was released in both gaming and professional SKUs. Neither audience had access to a chip that was larger than the other's largest of that generation. Clock rates and disabled portions varied by specific product, with gaming usually getting the more aggressive performance for slightly better benchmarks. Fermi had GF100/GF110, Kepler had GK110/GK210, and Maxwell had GM200. Each of these were available in Tesla, Quadro, and GeForce cards, especially Titans.

Maxwell was interesting, though. NVIDIA was unable to leave 28nm, which Kepler launched on, so they created a second architecture at that node. To increase performance without having access to more feature density, you need to make your designs bigger, more optimized, or more simple. GM200 was giant and optimized, but, to get the performance levels it achieved, also needed to be more simple. Something needed to go, and double-precision (FP64) performance was the big omission. NVIDIA was upfront about it at the Titan X launch, and told their GPU compute customers to keep purchasing Kepler if they valued FP64.

Fast-forward to Pascal.

Manufacturer: NVIDIA

93% of a GP100 at least...

NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.

nvidia-2016-gtc-pascal-banner.png

NVIDIA provided a comparison table, which we added what we know about a full GP100 to:

  Tesla K40 Tesla M40 Tesla P100 Full GP100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal)
SMs 15 24 56 60
TPCs 15 24 28 (30?)
FP32 CUDA Cores / SM 192 128 64 64
FP32 CUDA Cores / GPU 2880 3072 3584 3840
FP64 CUDA Cores / SM 64 4 32 32
FP64 CUDA Cores / GPU 960 96 1792 1920
Base Clock 745 MHz 948 MHz 1328 MHz TBD
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz TBD
FP64 GFLOPS 1680 213 5304 TBD
Texture Units 240 192 224 240
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB TBD
L2 Cache Size 1536 KB 3072 KB 4096 KB TBD
Register File Size / SM 256 KB 256 KB 256 KB 256 KB
Register File Size / GPU 3840 KB 6144 KB 14336 KB 15360 KB
TDP 235 W 250 W 300 W TBD
Transistors 7.1 billion 8 billion 15.3 billion 15.3 billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610mm2
Manufacturing Process 28 nm 28 nm 16 nm 16nm

This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.

nvidia-2016-gp100_block_diagram-1-624x368.png

A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.

Continue reading our preview of the NVIDIA Pascal architecture!!

Tesla Motors Hires Peter Bannon of Apple

Subject: Graphics Cards, Processors | February 29, 2016 - 11:48 PM |
Tagged: tesla motors, tesla, SoC, Peter Bannon, Jim Keller

When we found out that Jim Keller has joined Tesla, we were a bit confused. He is highly skilled in processor design, and he moved to a company that does not design processors. Kind of weird, right? There are two possibilities that leap to mind: either he wanted to try something new in life, and Elon Musk hired him for his general management skills, or Tesla wants to get more involved in the production of their SoCs, possibly even designing their own.

tesla-2016-logo2.png

Now Peter Bannon, who was a colleague of Jim Keller at Apple, has been hired by Tesla Motors. Chances are, the both of them were not independently interested in an abrupt career change that led them to the same company. That seems highly unlikely, to say the least. So it appears that Tesla Motors wants experienced chip designers in house. What for? We don't know. This is a lot of talent to just look over the shoulders of NVIDIA and other SoC partners, to make sure they have an upper hand in negotiation. Jim Keller is at Tesla as their “Vice-President of Autopilot Hardware Engineering.” We don't know what Peter Bannon's title will be.

And then, if Tesla Motors does get into creating their own hardware, we wonder what they will do with it. The company has a history of open development and releasing patents (etc.) into the public. That said, SoC design is a highly encumbered field, depending on what they're specifically doing, which we have no idea about.

Source: Eletrek

Podcast #385 - Rise of the Tomb Raider Performance, 3x NVMe M.2 RAID-0, AMD Q1 Offerings

Subject: General Tech | February 4, 2016 - 04:53 PM |
Tagged: video, Trion 150, tesla, steam os, Samsung, rise of the tomb raider, podcast, ocz, NVMe, Jim Keller, amd, 950 PRO

PC Perspective Podcast #385 - 02/04/2016

Join us this week as we discuss Rise of the Tomb Raider performance, a triple RAID-0 NVMe array and more!

You can subscribe to us through iTunes and you can still access it directly through the RSS page HERE.

The URL for the podcast is: http://pcper.com/podcast - Share with your friends!

Hosts: Ryan Shrout, Jeremy Hellstrom, Josh Walrath, and Allyn Malventano

Subscribe to the PC Perspective YouTube Channel for more videos, reviews and podcasts!!

So That's Where Jim Keller Went To... Tesla Motors...

Subject: General Tech, Processors, Mobile | January 29, 2016 - 10:28 PM |
Tagged: tesla, tesla motors, amd, Jim Keller, apple

Jim Keller, a huge name in the semiconductor industry for his work at AMD and Apple, recently left AMD before the launch of the Zen architecture. This made us nervous, because when a big name leaves a company before a product launch, it could either be that their work is complete... or they're evacuating before a stink-bomb detonates and the whole room smells like rotten eggs.

jim_keller.jpg

It turns out a third option is possible: Elon Musk offers you a job making autonomous vehicles. Jim Keller's job title at Tesla will be Vice President of Autopilot Hardware Engineering. I could see this position being enticing, to say the least, even if you are confident in your previous employer's upcoming product stack. It doesn't mean that AMD's Zen architecture will be either good or bad, but it nullifies the earlier predictions, when Jim Keller left AMD, at least until further notice.

We don't know who approached who, or when.

Another point of note: Tesla Motors currently uses NVIDIA Tegra SoCs in their cars, who are (obviously) competitors of Jim Keller's former employer, AMD. It sounds like Jim Keller is moving into a somewhat different role than he had at AMD and Apple, but it could be interesting if Tesla starts taking chip design in-house, to customize the chip to their specific needs, and take away responsibilities from NVIDIA.

The first time he was at AMD, he was the lead architecture of the Athlon 64 processor, and he co-authored x86-64. When he worked at Apple, he helped design the Apple A4 and A5 processors, which were the first two that Apple created in-house; the first three iPhone processors were Samsung SoCs.

Phoronix Tests Almost a Decade of GPUs

Subject: Graphics Cards | January 20, 2016 - 08:26 PM |
Tagged: nvidia, linux, tesla, fermi, kepler, maxwell

It's nice to see long-term roundups every once in a while. They do not really provide useful information for someone looking to make a purchase, but they show how our industry is changing (or not). In this case, Phoronix tested twenty-seven NVIDIA GeForce cards across four architectures: Tesla, Fermi, Kepler, and Maxwell. In other words, from the GeForce 8 series all the way up to the GTX 980 Ti.

phoronix-2016-many-nvidia-roundup.jpg

Image Credit: Phoronix

Nine years of advancements in ASIC design, with a doubling time-step of 18 months, should yield a 64-fold improvement. The number of transistors falls short, showing about a 12-fold improvement between the Titan X and the largest first-wave Tesla, although that means nothing for a fabless semiconductor designer. The main reason why I include this figure is to show the actual Moore's Law trend over this time span, but it also highlights the slowdown in process technology.

Performance per watt does depend on NVIDIA though, and the ratio between the GTX 980 Ti and the 8500 GT is about 72:1. While this is slightly better than the target 64:1 ratio, these parts are from very different locations in their respective product stacks. Swapping the 8500 GT for the following year's 9800 GTX, which leads to a comparison between top-of-the-line GPUs of their respective times, and you see a 6.2x improvement in performance per watt versus the GTX 980 Ti. On the other hand, that part was outstanding for its era.

I should note that each of these tests take place on Linux. It might not perfectly reflect the landscape on Windows, but again, it's interesting in its own right.

Source: Phoronix

Tokyo Tech Goes Green with KFC (NVIDIA and Efficiency)

Subject: General Tech, Graphics Cards, Systems | November 22, 2013 - 02:47 AM |
Tagged: nvidia, tesla, supercomputing

GPUs are very efficient in terms of operations per watt. Their architecture is best suited for a gigantic bundle of similar calculations (such as a set of operations for each entry of a large blob of data). These are the tasks which also take up the most computation time especially for, not surprisingly, 3D graphics (where you need to do something to every pixel, fragment, vertex, etc.). It is also very relevant for scientific calculations, financial and other "big data" services, weather prediction, and so forth.

nvidia-submerge.png

Tokyo Tech KFC achieves over 4 GigaFLOPs per watt of power draw from 160 Tesla K20X GPUs in its cluster. That is about 25% more calculations per watt than current leader of the Green500 (CINECA Eurora System in Italy, with 3.208 GFLOPs/W).

One interesting trait: this supercomputer will be cooled by oil immersion. NVIDIA offers passively cooled Tesla cards which, according to my understanding of how this works, suit very well to this fluid system. I am fairly certain that they remove all of the fans before dunking the servers (I figured they would be left on).

By the way, was it intentional to name computers dunked in giant vats of heat-conducting oil, "KFC"?

Intel has done a similar test, which we reported on last September, submerging numerous servers for over a year. Another benefit of being green is that you are not nearly as concerned about air conditioning.

NVIDIA is actually taking it to the practical market with another nice supercomputer win.

Other NVIDIA Supercomputing News:

Source: NVIDIA

NVIDIA Tesla K40: GK110b Gets a Career (and more vRAM)

Subject: General Tech, Graphics Cards | November 18, 2013 - 08:33 PM |
Tagged: tesla, nvidia, K40, GK110b

The Tesla K20X ruled NVIDIA's headless GPU portfolio for quite some time now. The part is based on the GK110 chip with 192 shader cores disabled, like the GeForce Titan, and achieved 3.9 TeraFLOPs of compute performance (1.31 TeraFLOPs in double precision). Also, like the Titan, the K20X offers 6GB of memory.

nvidia-k40-hero.jpg

The Tesla K40X

So the layout was basically the following: GK104 ruled the gamer market except for the, in hindsight, oddly-positioned GeForce Titan which was basically a Tesla K20X without a few features like error correction (ECC). The Quadro K6000 was the only card to utilize all 2880 CUDA cores.

Then, at the recent G-Sync event, NVIDIA CEO Jen-Hsun Huang announced the GeForce GTX 780Ti. This card uses the GK110b processor and incorporates all 2880 CUDA cores albeit with reduced double-precision performance (for the 780 Ti, not for GK110b in general). So now we have Quadro and GeForce with the full power Kepler, your move Tesla.

And they did, the Tesla K40 launched this morning and it brought more than just cores.

nvidia-tesla-k40.png

A brief overview

The GeForce launch was famous for its inclusion of GPU Boost, a feature absent in the Tesla line. It turns out that NVIDIA was paying attention to the feature but wanted to include it in a way that suited data centers. GeForce cards boost based on the status of the card, its temperature or its power draw. This is apparently unsuitable for data centers because they would like every unit operating at a very similar performance. The Tesla K40 has a base clock of 745 MHz but gives the data center two boost clocks that they can manually set: 810 MHz and 875 MHz.

nvidia-telsa-k40-2.png

Relative performance benchmarks

The Tesla K40 also doubles the amount of RAM to 12GB. Of course this allows for the GPU to work on larger data sets without streaming in the computation from system memory or worse.

There is currently no public information on pricing for the Tesla K40 but it is available starting today. What we do know are the launch OEM partners: ASUS, Bull, Cray, Dell, Eurotech, HP, IBM, Inspur, SGI, Sugon, Supermicro, and Tyan.

If you are interested in testing out a K40, NVIDIA has remotely hosted clusters that your company can sign up for at the GPU Test Drive website.

Press blast after the break!

Source: NVIDIA

NVIDIA's plans for Tegra and Tesla

Subject: General Tech | April 24, 2013 - 05:38 PM |
Tagged: Steve Scott, nvidia, HPC, tesla, logan, tegra

The Register had a chance to sit down with Steve Scott, once CTO of Cray and now CTO of NVIDIA's Tesla projects to discuss the future of their add-in cards as well as that of x86 in the server room.  They discussed Tegra and why it is not receiving the same amount of attention at NVIDIA as Tegra is, as well as some of the fundamental differences in the chips both currently and going forward.  NVIDIA plans to unite GPU and CPU onto both families of chips, likely with a custom interface as opposed to placing them on the same die, though both will continue to be designed for very different functions.  A lot of the article focuses on Tegra, its memory bandwidth and most importantly its networking capabilities as it seems NVIDIA is focused on the server room and providing hundreds or thousands of interconnected Tegra processors to compete directly with x86 offerings.  Read on for the full interview.

ELreg_nvidia_echelon_system.jpg

"Jen-Hsun Huang, co-founder and CEO of Nvidia has been perfectly honest about the fact that the graphics chip maker didn't intend to get into the supercomputing business. Rather, it was founded by a bunch of gamers who wanted better graphics cards to play 3D games. Fast forward two decades, though, and the Nvidia Tesla GPU coprocessor and the CUDA programming environment have taken the supercomputer world by storm."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

GTC 2013: Pedraforca Is A Power Efficient ARM + GPU Cluster For Homogeneous (GPU) Workloads

Subject: General Tech, Graphics Cards | March 20, 2013 - 05:47 PM |
Tagged: tesla, tegra 3, supercomputer, pedraforca, nvidia, GTC 2013, GTC, graphics cards, data centers

There is a lot of talk about heterogeneous computing at GTC, in the sense of adding graphics cards to servers. If you have HPC workloads that can benefit from GPU parallelism, adding GPUs gives you computing performance in less physical space, and using less power, than a CPU only cluster (for equivalent TFLOPS).

However, there was a session at GTC that actually took things to the opposite extreme. Instead of a CPU only cluster or a mixed cluster, Alex Ramirez (leader of Heterogeneous Architectures Group at Barcelona Supercomputing Center) is proposing a homogeneous GPU cluster called Pedraforca.
Pedraforca V2 combines NVIDIA Tesla GPUs with low power ARM processors. Each node is comprised of the following components:

  • 1 x Mini-ITX carrier board
  • 1 x Q7 module (which hosts the ARM SoC and memory)
    • Current config is one Tegra 3 @ 1.3GHz and 2GB DDR2
  • 1 x NVIDIA Tesla K20 accelerator card (1170 GFLOPS)
  • 1 x InfiniBand 40Gb/s card (via Mellanox ConnectX-3 slot)
  • 1 x 2.5" SSD (SATA 3 MLC, 250GB)

The ARM processor is used solely for booting the system and facilitating GPU communication between nodes. It is not intended to be used for computing. According to Dr. Ramirez, in situations where running code on a CPU would be faster, it would be best to have a small number of Intel Xeon powered nodes to do the CPU-favorable computing, and then offload the parallel workloads to the GPU cluster over the InfiniBand connection (though this is less than ideal, Pedraforca would be most-efficient with data-sets that can be processed solely on the Tesla cards).

DSCF2421.JPG

While Pedraforca is not necessarily locked to NVIDIA's Tegra hardware, it is currently the only SoC that meets their needs. The system requires the ARM chip to have PCI-E support. The Tegra 3 SoC has four PCI-E lanes, so the carrier board is using two PLX chips to allow the Tesla and InfiniBand cards to both be connected.

The researcher stated that he is also looking forward to using NVIDIA's upcoming Logan processor in the Pedraforca cluster. It will reportedly be possible to upgrade existing Pedraforca clusters with the new chips by replacing the existing (Tegra 3) Q7 module with one that has the Logan SoC when it is released.

Pedraforca V2 has an initial cluster size of 64 nodes. While the speaker was reluctant to provide TFLOPS performance numbers, as it would depend on the workload, with 64 Telsa K20 cards, it should provide respectable performance. The intent of the cluster is to save power costs by using a low power CPU. If your sever kernel and applications can run on GPUs alone, there are noticeable power savings to be had by switching from a ~100W Intel Xeon chip to a lower-power (approximately 2-3W) Tegra 3 processor. If you have a kernel that needs to run on a CPU, it is recommended to run the OS on an Intel server and transfer just the GPU work to the Pedraforca cluster. Each Pedraforca node is reportedly under 300W, with the Tesla card being the majority of that figure. Despite the limitations, and niche nature of the workloads and software necessary to get the full power-saving benefits, Pedraforca is certainly an interesting take on a homogeneous server cluster!

DSCF2413.JPG

In another session relating to the path to exascale computing, power use in data centers was listed as one of the biggest hurdles to getting to Exaflop-levels of performance, and while Pedraforca is not the answer to Exascale, it should at least be a useful learning experience at wringing the most parallelism out of code and pushing GPGPU to the limits. And that research will help other clusters use the GPUs more efficiently as researchers explore the future of computing.

The Pedraforca project built upon research conducted on Tibidabo, a multi-core ARM CPU cluster, and CARMA (CUDA on ARM development kit) which is a Tegra SoC paired with an NVIDIA Quadro card. The two slides below show CARMA benchmarks and a Tibidabo cluster (click on image for larger version).

Stay tuned to PC Perspective for more GTC 2013 coverage!