NVIDIA rides shotgun in autonomous vehicles

Subject: General Tech | August 9, 2017 - 12:43 PM |
Tagged: nvidia, autonomous vehicles, HPC

NVIDIA has previously shown their interest in providing the brains for autonomous vehicles, their Xavier chip is scheduled for release some time towards the end of the year.  They are continuing their efforts to break into this market by investing in start ups in a program called GPU Ventures.  Today DigiTimes reports that NVIDIA purchased a stake in a Chinese company called Tusimple which is developing autonomous trucks.  The transportation of goods may not be as interesting to the average consumer as self driving cars but the market could be more lucrative; there are a lot of trucks on the roads of the world and they are unlikely to be replaced any time soon.


"Tusimple, a Beijing-based startup focused on developing autonomous trucks, has disclosed that Nvidia will make a strategic investment to take a 3% stake in the company. Nvidia's investment is part of a a Series B financing round, Tusimple indicated."

Here is some more Tech News from around the web:

Tech Talk

Source: DigiTimes

Go west young researcher! AMD's Radeon Vega Frontier Edition is available now

Subject: Graphics Cards | June 27, 2017 - 06:51 PM |
Tagged: Vega FE, Vega, HPC, amd

AMD have released their new HPC card, the Radeon Vega Frontier Edition, which Jim told you about earlier this week.  The air cooled version is available now, with an MSRP of $999USD followed by a water-cooled edition arriving in Q3 with price tag of $1499.


The specs they list for the cards are impressive and compare favourably to NVIDIA's P100 which is the card AMD tested against, offering higher TFLOPS for both FP32 and FP16 operations though the memory bandwidth lags a little behind.

  Radeon Vega
Frontier Edition
Quadro GP100
GPU Vega GP100
Peak/Boost Clock 1600 MHz 1442 MHz
FP32 TFLOPS (SP) 13.1 10.3


Memory Interface 1.89 Gb/s
2048-bit HBM2
1.4 Gbps
4096-bit HBM2
Memory Bandwidth 483 GB/s 716 GB/s
Memory Size 16GB HBC* 16GB
TDP 300 W air, 375 W water 235 W

The memory size for the Vega is interesting, HBC is AMDs High Bandwidth Cache Controller which not only uses the memory cache more effectively but is able to reach out to other high performance system memory for help.  AMD states that the Radeon Vega Frontier Edition has the capability of expanding traditional GPU memory to 256TB; perhaps allowing new texture mods for Skyrim or Fallout!  Expect to see more detail on this feature once we can get our hands on a card to abuse, nicely of course.


AMD used the DeepBench Benchmark to provide comparative results, the AMD Vega FE system used a dual socketed system with Xeon E5 2640v4s @ 2.4Ghz 10C/20T, 32GB DDR4 per socket, on Ubuntu 16.04 LTS with ROCm 1.5, and OpenCL 1.2, the NVIDIA Tesla P100 system used the same hardware with the CuDNN 5.1, Driver 375.39 and Cuda version 8.0.61 drivers.  Those tests showed the AMD system completing the benchmark in 88.7ms, the Tesla P100 completed in 133.1 ms, quite an impressive lead for AMD.  Again, there will be much more information on performance once the Vega FE can be tested.


Read on to hear about the new card in AMD's own words, with links to their sites.

Source: AMD

SoftBank Invests $4 Billion In NVIDIA, Becomes Fourth Largest Shareholder

Subject: General Tech, Graphics Cards | May 27, 2017 - 12:18 AM |
Tagged: vision fund, softbank, nvidia, iot, HPC, ai

SoftBank, the Tokyo, Japan based Japanese telecom and internet technology company has reportedly quietly amassed a 4.9% stake in graphics chip giant NVIDIA. Bloomberg reports that SoftBank has carefully invested $4 billion into NVIDIA avoiding the need to get regulatory approval in the US by keeping its investment under 5% of the company. SoftBank has promised the current administration that it will invest $50 billion into US tech companies and it seems that NVIDIA is the first major part of that plan.


NVIDIA's Tesla V100 GPU.

Led by Chairman and CEO Masayoshi Son, SoftBank is not afraid to invest in technology companies it believes in with major past acquisitions and investments in companies like ARM Holdings, Sprint, Alibaba, and game company Supercell.

The $4 billion-dollar investment makes SoftBank the fourth largest shareholder in NVIDIA, which has seen the company’s stock rally from SoftBank’s purchases and vote of confidence. The (currently $93) $100 billion Vision Fund may also follow SoftBank’s lead in acquiring a stake in NVIDIA which is involved in graphics, HPC, AI, deep learning, and gaming.

Overall, this is good news for NVIDIA and its shareholders. I am curious what other plays SoftBank will make for US tech companies.

What are your thoughts on SoftBank investing heavily in NVIDIA?

Intel Persistent Memory Using 3D XPoint DIMMs Expected Next Year

Subject: General Tech, Memory, Storage | May 26, 2017 - 10:14 PM |
Tagged: XPoint, Intel, HPC, DIMM, 3D XPoint

Intel recently teased a bit of new information on its 3D XPoint DIMMs and launched its first public demonstration of the technology at the SAP Sapphire conference where SAP’s HANA in-memory data analytics software was shown working with the new “Intel persistent memory.” Slated to arrive in 2018, the new Intel DIMMs based on the 3D XPoint technology developed by Intel and Micron will work in systems alongside traditional DRAM to provide a pool of fast, low latency, and high density nonvolatile storage that is a middle ground between expensive DDR4 and cheaper NVMe SSDs and hard drives. When looking at the storage stack, the storage density increases along with latency as it gets further away from the CPU. The opposite is also true, as storage and memory gets closer to the processor, bandwidth increases, latency decreases, and costs increase per unit of storage. Intel is hoping to bridge the gap between system DRAM and PCI-E and SATA storage.

Intel persistent memory DIMM.jpg

According to Intel, system RAM offers up 10 GB/s per channel and approximately 100 nanoseconds of latency. 3D XPoint DIMMs will offer 6 GB/s per channel and about 250 nanoseconds of latency. Below that is the 3D XPoint-based NVMe SSDs (e.g. Optane) on a PCI-E x4 bus where they max out the bandwidth of the bus at ~3.2 GB/s and 10 microseconds of latency. Intel claims that non XPoint NVMe NAND solid state drives have around 100 microsecomds of latency, and of course, it gets worse from there when you go to NAND-based SSDs or even hard drives hanging of the SATA bus.

Intel’s new XPoint DIMMs have persistent storage and will offer more capacity that will be possible and/or cost effective with DDR4 DRAM. In giving up some bandwidth and latency, enterprise users will be able to have a large pool of very fast storage for storing their databases and other latency and bandwidth sensitive workloads. Intel does note that there are security concerns with the XPoint DIMMs being nonvolatile in that an attacker with physical access could easily pull the DIMM and walk away with the data (it is at least theoretically possible to grab some data from RAM as well, but it will be much easier to grab the data from the XPoint sticks. Encryption and other security measures will need to be implemented to secure the data, both in use and at rest.

Intel Slide XPoint Info.jpg

Interestingly, Intel is not positioning the XPoint DIMMs as a replacement for RAM, but instead as a supplement. RAM and XPoint DIMMs will be installed in different slots of the same system and the DDR4 RAM will be used for the OS and system critical applications while the XPoint pool of storage will be used for storing data that applications will work on much like a traditional RAM disk but without needing to load and save the data to a different medium for persistent storage and offering a lot more GBs for the money.

While XPoint is set to arrive next year along with Cascade Lake Xeons, it will likely be a couple of years before the technology takes off. Supporting it is going to require hardware and software support for the workstations and servers as well as developers willing to take advantage of it when writing their specialized applications. Fortunately, Intel started shipping the memory modules to its partners for testing earlier this year. It is an interesting technology and the DIMM solution and direct CPU interface will really let the 3D XPoint memory shine and reach its full potential. It will primarily be useful for the enterprise, scientific, and financial industries where there is a huge need for faster and lower latency storage that can accommodate massive (multiple terabyte+) data sets that continue to get larger and more complex. It is a technology that likely will not trickle down to consumers for a long time, but I will be ready when it does. In the meantime, I am eager to see what kinds of things it will enable the big data companies and researchers to do! Intel claims it will not only be useful at supporting massive in-memory databases and accelerating HPC workloads but for things like virtualization, private clouds, and software defined storage.

What are your thoughts on this new memory tier and the future of XPoint?

Also read:

Source: Intel

AMD Prepares Zen-Based "Naples" Server SoC For Q2 Launch

Subject: Processors | March 7, 2017 - 09:02 AM |
Tagged: SoC, server, ryzen, opteron, Naples, HPC, amd

Over the summer, AMD introduced its Naples platform which is the server-focused implementation of the Zen microarchitecture in a SoC (System On a Chip) package. The company showed off a prototype dual socket Naples system and bits of information leaked onto the Internet, but for the most part news has been quiet on this front (whereas there were quite a few leaks of Ryzen which is AMD's desktop implementation of Zen).

The wait seems to be finally over, and AMD appears ready to talk more about Naples which will reportedly launch in the second quarter of this year (Q2'17) with full availability of processors and motherboards from OEMs and channel partners (e.g. system integrators) happening in the second half of 2017. Per AMD, "Naples" processors are SoCs with 32 cores and 64 threads that support 8 memory channels and a (theoretical) maximum of 2TB DDR4-2667. (Using the 16GB DIMMs available today, Naples support 256GB of DDR4 per socket.) Further, the Naples SoC features 64 PCI-E 3.0 lanes. Rumors also indicated that the SoC included support for sixteen 10GbE interfaces, but AMD has yet to confirm this or the number of SATA/SAS ports offered. AMD did say that Naples has an optimized cache structure for HPC compute and "dedicated security hardware" though it did not go into specifics. (The security hardware may be similar to the ARM TrustZone technology it has used in the past.) 

AMD Naples.jpg

Naples will be offered in single and dual socket designs with dual socket systems offering up 64 cores, 128 threads, 32 DDR4 DIMMs (512 GB using 16 GB modules) on 16 total memory channels with 21.3 GB/s per channel bandwidth (170.7 GB/s per SoC), 128 PCI-E 3.0 lanes, and an AMD Infinity Fabric interconnect between the two processor sockets.

AMD claims that its Naples platform offers up to 45% more cores, 122% more memory bandwidth, and 60% more I/O than its competition. For its internal comparison, AMD chose the Intel Xeon E5-2699A V4 which is the processor with highest core count that is intended for dual socket systems (there are E7s with more cores but those are in 4P systems). The Intel Xeon E5-2699A V4 system is a 14nm 22 core (44 thread) processor clocked at 2.4 GHz base to 3.6 GHz turbo with 55MB cache. It supports four channels of DDR4-2400 for a maximum bandwidth of 76.8 GB/s (19.2 GB/s per channel) as well as 40 PCI-E 3.0 lanes. A dual socket system with two of those Xeons features 44 cores, 88 threads, and a theoretical maximum of 1.54 TB of ECC RAM.

AMD's reference platform with two 32 core Naples SoCs and 512 GB DDR4 2400 MHz was purportedly 2.5x faster at the seismic analysis workload than the dual Xeon E5-2699A V4 OEM system with 1866 MHz DDR4. Curiously, when AMD compared a Naples reference platform with 44 cores enabled and running 1866 MHz memory to a similarly configured Intel system the Naples platform was twice as fast. It seems that the increased number of memory channels and memory bandwidth are really helping the Naples platform pull ahead in this workload.

AMD Naples and Radeon Instinct.png

The company also intends Naples to power machine learning and AI projects with servers that feature Naples processors and Radeon Instinct graphics processors.

AMD further claims that its Naples platform is more balanced and suited to cloud computing and scientific and HPC workloads than the competition. Specifically, Forrest Norrod the Senior Vice president and General Manager of AMD's Enterprise, Embedded, and Semi-Custom Business Unit stated:

“’Naples’ represents a completely new approach to supporting the massive processing requirements of the modern datacenter. This groundbreaking system-on-chip delivers the unique high-performance features required to address highly virtualized environments, massive data sets and new, emerging workloads.”

There is no word on pricing yet, but it should be competitive with Intel's offerings (the E5-2699A V4 is $4,938). AMD will reportedly be talking data center strategy and its upcoming products during the Open Compute Summit later this week, so hopefully there will be more information released at those presentations.

(My opinions follow)

This is one area where AMD needs to come out strong with support from motherboard manufacturers, system integrators, OEM partners, and OS and software validation to succeed. Intel is not likely to take AMD encroaching on its lucrative server market share lightly, and AMD is going to have a long road ahead of it to regain the market share it once had in this area, but it does have a decent architecture on its hands to build off of with Zen and if it can secure partner support Intel is certainly going to have competition here that it has not had to face in a long time. Intel and AMD competing over the data center market is a good thing, and as both companies bring new technology to market it will trickle down into the consumer level hardware. Naples' success in the data center could mean a profitable AMD with R&D money to push Zen as far as it can – so hopefully they can pull it off.

What are your thoughts on the Naples SoC and AMD's push into the server market?

Also read:

Source: AMD

Optalysys GENESYS Optical Co-Processor Announced

Subject: General Tech | November 13, 2016 - 06:24 PM |
Tagged: optical computing, HPC

We occasionally discuss photonic computers as news is announced, because we're starting to reach “can count the number of atoms with fingers and toes” sizes of features. For instance, we reported on a chip made by University of Colorado Boulder and UC Berkeley that had both electric and photonic integrated circuits on it.

This announcement from Optalysys is completely different.

The Optalysys GENESYS is a PCIe add-in board that is designed to accelerate certain tasks. For instance, light is fourier transformed when it passes through a lens, and reverse fourier transformed when it is refocused by a second lens. When I was taking fourth-year optics back in 2009, our professor mentioned that scientists used this trick to solve fourier transforms by flashing light through a 2D pattern, passing through a lens, and being projected upon film. This image was measured pixel by pixel, with each intensity corresponding to the 2D fourier transform's value of the original pattern. Fourier transforms are long processes to solve algebraically, especially without modern computers, so this was a huge win; you're solving a 2D grid of values in a single step.

These are the sort of tricks that the Optalysys GENESYS claims to use. They claim that this will speed up matrix multiplications, convolutions (fourier transforms -- see previous paragraph), and pattern recognition (such as for DNA sequencing). Matrix multiplications is a bit surprising to me, because it's not immediately clear how you can abuse light dynamics to calculate this, but someone who has more experience in this field will probably say “Scott, you dummy, we've been doing this since the 1800s” or something.


Image Credit: Tom Roelandts
The circles of the filter (center) correspond to the frequencies it blocks or permits.
The frequencies correspond to how quick an image changes.
This is often used for noise reduction or edge detection, but it's just a filter in fourier space.
You could place it between two lenses to modify the image in that way.

From a performance standpoint, their “first demonstrator system” operated at 20Hz with 500x500 resolution. However, their video claims they expect to have a “PetaFLOP-equivalent co-processor” by the end of the 2017. For comparison, modern GPUs are just barely in the 10s of TeraFLOPs, but that's about as useful as comparing a CPU core to a digital signal processor (DSP). (I'm not saying this is analogous to a DSP, but performance comparisons are about as useful.)

Optalysys expects to have a 1 PetaFLOP co-processor available by the end of the year.

Source: Optalysys

Eight is enough, looking at how the new Telsa HPC cards from NVIDIA will work

Subject: General Tech | September 14, 2016 - 01:06 PM |
Tagged: pascal, tesla, p40, p4, nvidia, neural net, m40, M4, HPC

The Register have package a nice explanation of the basics of how neural nets work in their quick look at NVIDIA's new Pascal based HPC cards, the P4 and P40.  The tired joke about Zilog or Dick Van Patten stems from the research which has shown that 8-bit precision is most effective when feeding data into a neural net.  Using 16 or 32-bit values slows the processing down significantly while adding little precision to the results produced.  NVIDIA is also perfecting a hybrid mode, where you can opt for a less precise answer produced by your local, presumably limited, hardware or you can upload the data to the cloud for the full treatment.  This is great for those with security concerns or when a quicker answer is more valuable than a more accurate one.

As for the hardware, NVIDIA claims the optimizations on the P40 will make it "40 times more efficient" than an Intel Xeon E5 CPU and it will also provide slightly more throughput than the currently available Titan X.  You can expect to see these arrive in the market sometime over then next two months.


"Nvidia has designed a couple of new Tesla processors for AI applications – the P4 and the P40 – and is talking up their 8-bit math performance. The 16nm FinFET GPUs use Nv's Pascal architecture and follow on from the P100 launched in June. The P4 fits on a half-height, half-length PCIe card for scale-out servers, while the beefier P40 has its eyes set on scale-up boxes."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

That old chestnut again? Intel compares their current gen hardware against older NVIDIA kit

Subject: General Tech | August 17, 2016 - 12:41 PM |
Tagged: nvidia, Intel, HPC, Xeon Phi, maxwell, pascal, dirty pool

There is a spat going on between Intel and NVIDIA over the slide below, as you can read about over at Ars Technica.  It seems that Intel have reached into the industries bag of dirty tricks and polished off an old standby, testing new hardware and software against older products from their competitors.  In this case it was high performance computing products which were tested, Intel's new Xeon Phi against NVIDIA's Maxwell, tested on an older version of the Caffe AlexNet benchmark.

NVIDIA points out that not only would they have done better than Intel if an up to date version of the benchmarking software was used, but that the comparison should have been against their current architecture, Pascal.  This is not quite as bad as putting undocumented flags into compilers to reduce the performance of competitors chips or predatory discount programs but it shows that the computer industry continues to have only a passing acquaintance with fair play and honest competition.


"At this juncture I should point out that juicing benchmarks is, rather sadly, par for the course. Whenever a chip maker provides its own performance figures, they are almost always tailored to the strength of a specific chip—or alternatively, structured in such a way as to exacerbate the weakness of a competitor's product."

Here is some more Tech News from around the web:

Tech Talk

Source: Ars Technica
Manufacturer: NVIDIA

93% of a GP100 at least...

NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.


NVIDIA provided a comparison table, which we added what we know about a full GP100 to:

  Tesla K40 Tesla M40 Tesla P100 Full GP100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal)
SMs 15 24 56 60
TPCs 15 24 28 (30?)
FP32 CUDA Cores / SM 192 128 64 64
FP32 CUDA Cores / GPU 2880 3072 3584 3840
FP64 CUDA Cores / SM 64 4 32 32
FP64 CUDA Cores / GPU 960 96 1792 1920
Base Clock 745 MHz 948 MHz 1328 MHz TBD
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz TBD
FP64 GFLOPS 1680 213 5304 TBD
Texture Units 240 192 224 240
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB TBD
L2 Cache Size 1536 KB 3072 KB 4096 KB TBD
Register File Size / SM 256 KB 256 KB 256 KB 256 KB
Register File Size / GPU 3840 KB 6144 KB 14336 KB 15360 KB
TDP 235 W 250 W 300 W TBD
Transistors 7.1 billion 8 billion 15.3 billion 15.3 billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610mm2
Manufacturing Process 28 nm 28 nm 16 nm 16nm

This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.


A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.

Continue reading our preview of the NVIDIA Pascal architecture!!

AMD Brings Dual Fiji and HBM Memory To Server Room With FirePro S9300 x2

Subject: Graphics Cards | April 5, 2016 - 02:13 AM |
Tagged: HPC, hbm, gpgpu, firepro s9300x2, firepro, dual fiji, deep learning, big data, amd

Earlier this month AMD launched a dual Fiji powerhouse for VR gamers it is calling the Radeon Pro Duo. Now, AMD is bringing its latest GCN architecture and HBM memory to servers with the dual GPU FirePro S9300 x2.

AMD Firepro S9300x2 Server HPC Card.jpg

The new server-bound professional graphics card packs an impressive amount of computing hardware into a dual-slot card with passive cooling. The FirePro S9300 x2 combines two full Fiji GPUs clocked at 850 MHz for a total of 8,192 cores, 512 TUs, and 128 ROPs. Each GPU is paired with 4GB of non-ECC HBM memory on package with 512GB/s of memory bandwidth which AMD combines to advertise this as the first professional graphics card with 1TB/s of memory bandwidth.

Due to lower clockspeeds the S9300 x2 has less peak single precision compute performance versus the consumer Radeon Pro Duo at 13.9 TFLOPS versus 16 TFLOPs on the desktop card. Businesses will be able to cram more cards into their rack mounted servers though since they do not need to worry about mounting locations for the sealed loop water cooling of the Radeon card.

  FirePro S9300 x2 Radeon Pro Duo R9 Fury X FirePro S9170
GPU Dual Fiji Dual Fiji Fiji Hawaii
GPU Cores 8192 (2 x 4096) 8192 (2 x 4096) 4096 2816
Rated Clock 850 MHz 1050 MHz 1050 MHz 930 MHz
Texture Units 2 x 256 2 x 256 256 176
ROP Units 2 x 64 2 x 64 64 64
Memory 8GB (2 x 4GB) 8GB (2 x 4GB) 4GB 32GB ECC
Memory Clock 500 MHz 500 MHz 500 MHz 5000 MHz
Memory Interface 4096-bit (HBM) per GPU 4096-bit (HBM) per GPU 4096-bit (HBM) 512-bit
Memory Bandwidth 1TB/s (2 x 512GB/s) 1TB/s (2 x 512GB/s) 512 GB/s 320 GB/s
TDP 300 watts ? 275 watts 275 watts
Peak Compute 13.9 TFLOPS 16 TFLOPS 8.60 TFLOPS 5.24 TFLOPS
Transistor Count 17.8B 17.8B 8.9B 8.0B
Process Tech 28nm 28nm 28nm 28nm
Cooling Passive Liquid Liquid Passive
MSRP $6000 $1499 $649 $4000

AMD is aiming this card at datacenter and HPC users working on "big data" tasks that do not require the accuracy of double precision floating point calculations. Deep learning tasks, seismic processing, and data analytics are all examples AMD says the dual GPU card will excel at. These are all tasks that can be greatly accelerated by the massive parallel nature of a GPU but do not need to be as precise as stricter mathematics, modeling, and simulation work that depend on FP64 performance. In that respect, the FirePro S9300 x2 has only 870 GLFOPS of double precision compute performance.

Further, this card supports a GPGPU optimized Linux driver stack called GPUOpen and developers can program for it using either OpenCL (it supports OpenCL 1.2) or C++. AMD PowerTune, and the return of FP16 support are also features. AMD claims that its new dual GPU card is twice as fast as the NVIDIA Tesla M40 (1.6x the K80) and 12 times as fast as the latest Intel Xeon E5 in peak single precision floating point performance. 

The double slot card is powered by two PCI-E power connectors and is rated at 300 watts. This is a bit more palatable than the triple 8-pin needed for the Radeon Pro Duo!

The FirePro S9300 x2 comes with a 3 year warranty and will be available in the second half of this year for $6000 USD. You are definitely paying a premium for the professional certifications and support. Here's hoping developers come up with some cool uses for the dual 8.9 Billion transistor GPUs and their included HBM memory!

Source: AMD