Subject: General Tech | November 12, 2015 - 02:46 AM | Tim Verry
Tagged: Tegra X1, nvidia, maxwell, machine learning, jetson, deep neural network, CUDA, computer vision
Nearly two years ago, NVIDIA unleashed the Jetson TK1, a tiny module for embedded systems based around the company's Tegra K1 "super chip." That chip was the company's first foray into CUDA-powered embedded systems capable of machine learning including object recognition, 3D scene processing, and enabling things like accident avoidance and self-parking cars.
Now, NVIDIA is releasing even more powerful kit called the Jetson TX1. This new development platform covers two pieces of hardware: the credit card sized Jetson TX1 module and a larger Jetson TX1 Development Kit that the module plugs into and provides plenty of I/O options and pin outs. The dev kit can be used by software developers or for prototyping while the module alone can be used with finalized embedded products.
NVIDIA foresees the Jetson TX1 being used in drones, autonomous vehicles, security systems, medical devices, and IoT devices coupled with deep neural networks, machine learning, and computer vision software. Devices would be able to learn from the environment in order to navigate safely, identify and classify objects of interest, and perform 3D mapping and scene modeling. NVIDIA partnered with several companies for proof-of-concepts including Kespry and Stereolabs.
Using the TX1, Kespry was able to use drones to classify and track in real time construction equipment moving around a construction site (in which the drone was not necessarily programmed for exactly as sites and weather conditions vary, the machine learning/computer vision was used to allow the drone to navigate the construction site and a deep neural network was used to identify and classify the type of equipment it saw using its cameras. Meanwhile Stereolabs used high resolution cameras and depth sensors to capture photos of buildings and then used software to reconstruct the 3D scene virtually for editing and modeling. You can find other proof-of-concept videos, including upgrading existing drones to be more autonomous posted here.
From the press release:
"Jetson TX1 will enable a new generation of incredibly capable autonomous devices," said Deepu Talla, vice president and general manager of the Tegra business at NVIDIA. "They will navigate on their own, recognize objects and faces, and become increasingly intelligent through machine learning. It will enable developers to create industry-changing products."
But what about the hardware side of things? Well, the TX1 is a respectable leap in hardware and compute performance. Sitting at 1 Teraflops of rated (FP16) compute performance, the TX1 pairs four ARM Cortex A57 and four ARM Cortex A53 64-bit CPU cores with a 256-core Maxwell-based GPU. Definitely respectable for its size and low power consumption, especially considering NVIDIA claims the SoC can best the Intel Skylake Core i7-6700K in certain workloads (thanks to the GPU portion). The module further contains 4GB of LPDDR4 memory and 16GB of eMMC flash storage.
In short, while on module storage has not increased, RAM has been doubled and compute performance has tripled for FP16 compute performance and jumped by approximately 40% for FP32 versus the Jetson TK1's 2GB of DDR3 and 192-core Kepler GPU. The TX1 also uses a smaller process node at 20nm (versus 28nm) and the chip is said to use "very little power." Networking support includes 802.11ac and Gigabit Ethernet. The chart below outlines the major differences between the two platforms.
|Jetson TX1||Jetson TK1|
|GPU (Architecture)||256-core (Maxwell)||192-core (Kepler)|
|CPU||4 x ARM Cortex A57 + 4 x A53||"4+1" ARM Cortex A15 "r3"|
|RAM||4 GB LPDDR4||2 GB LPDDR3|
|eMMC||16 GB||16 GB|
|Compute Performance (FP16)||1 TFLOP||326 GFLOPS|
|Compute Performance (FP32) - via AnandTech||512 GFLOPS (AT's estimation)||326 GFLOPS (NVIDIA's number)|
The TX1 will run the Linux For Tegra operating system and supports the usual suspects of CUDA 7.0, cuDNN, and VisionWorks development software as well as the latest OpenGL drivers (OpenGL 4.5, OpenGL ES 3.1, and Vulkan).
NVIDIA is continuing to push for CUDA Everywhere, and the Jetson TX1 looks to be a more mature product that builds on the TK1. The huge leap in compute performance should enable even more interesting projects and bring more sophisticated automation and machine learning to smaller and more intelligent devices.
For those interested, the Jetson TX1 Development Kit (the full I/O development board with bundled module) will be available for pre-order today at $599 while the TX1 module itself will be available soon for approximately $299 each in orders of 1,000 or more (like Intel's tray pricing).
With CUDA 7, it is apparently possible for the GPU to be used for general purpose processing as well which may open up some doors that where not possible before in such a small device. I am interested to see what happens with NVIDIA's embedded device play and what kinds of automated hardware is powered by the tiny SoC and its beefy graphics.
Subject: General Tech, Mobile | March 25, 2014 - 09:34 PM | Tim Verry
Tagged: GTC 2014, tegra k1, nvidia, CUDA, kepler, jetson tk1, development
NVIDIA recently unified its desktop and mobile GPU lineups by moving to a Kepler-based GPU in its latest Tegra K1 mobile SoC. The move to the Kepler architecture has simplified development and enabled the CUDA programming model to run on mobile devices. One of the main points of the opening keynote earlier today was ‘CUDA everywhere,’ and NVIDIA has officially accomplished that goal by having CUDA compatible hardware from servers to desktops to tablets and embedded devices.
Speaking of embedded devices, NVIDIA showed off a new development board called the Jetson TK1. This tiny new board features a NVIDIA Tegra K1 SoC at its heart along with 2GB RAM and 16GB eMMC storage. The Jetson TK1 supports a plethora of IO options including an internal expansion port (GPIO compatible), SATA, one half-mini PCI-e slot, serial, USB 3.0, micro USB, Gigabit Ethernet, analog audio, and HDMI video outputs.
Of course the Tegra K1 part is a quad core (4+1) ARM CPU and a Kepler-based GPU with 192 CUDA cores. The SoC is rated at 326 GFLOPS which enables some interesting compute workloads including machine vision.
In fact, Audi has been utilizing the Jetson TK1 development board to power its self-driving prototype car (more on that soon). Other intended uses for the new development board include robotics, medical devices, security systems, and perhaps low power compute clusters (such as an improved Pedraforca system).It can also be used as a simple desktop platform for testing and developing mobile applications for other Tegra K1 powered devices, of course.
Beyond the hardware, the Jetson TK1 comes with the CUDA toolkit, OpenGL 4.4 driver, and NVIDIA VisionWorks SDK which includes programming libraries and sample code for getting machine vision applications running on the Tegra K1 SoC.
The Jetson TK1 is available for pre-order now at $192 and is slated to begin shipping in April. Interested developers can find more information on the NVIDIA developer website.
Subject: General Tech | March 25, 2014 - 05:46 PM | Tim Verry
Tagged: gtx titan z, gtx titan, GTC 2014, CUDA
During the opening keynote, NVIDIA showed off several pieces of hardware that will be available soon. On the desktop and workstation side of things, researchers (and consumers chasing the ultra high end) have the new GTX Titan Z to look forward to. This new graphics card is a dual GK110 GPU monster that offers up 8 TeraFLOPS of number crunching performance for an equally impressive $2,999 price tag.
Specifically, the GTX TITAN Z is a triple slot graphics card that marries two full GK110 (big Kepler) GPUs for a total of 5,760 CUDA cores, 448 TMUs, and 96 ROPs with 12GB of GDDR5 memory on a 384-bit bus (6GB on a 384-bit bus per GPU). NVIDIA has yet to release clockspeeds, but the two GPUs will run at the same clocks with a dynamic power balancing feature. Four the truly adventurous, it appears possible to SLI two GTX Titan Z cards using the single SLI connector. Display outputs include two DVI, one HDMI, and one DisplayPort connector.
NVIDIA is cooling the card using a single fan and two vapor chambers. Air is drawn inwards and exhausted out of the front exhaust vents.
In short, the GTX Titan Z is NVIDIA's new number crunching king and should find its way into servers and workstations running big data analytics and simulations. Personally, I'm looking forward to seeing someone slap two of them into a gaming PC and watching the screen catch on fire (not really).
What do you think about the newest dual GPU flagship?
Stay tuned to PC Perspective for further GTC 2014 coverage!
NVIDIA Finally Gets Serious with Tegra
Tegra has had an interesting run of things. The original Tegra 1 was utilized only by Microsoft with Zune. Tegra 2 had a better adoption, but did not produce the design wins to propel NVIDIA to a leadership position in cell phones and tablets. Tegra 3 found a spot in Microsoft’s Surface, but that has turned out to be a far more bitter experience than expected. Tegra 4 so far has been integrated into a handful of products and is being featured in NVIDIA’s upcoming Shield product. It also hit some production snags that made it later to market than expected.
I think the primary issue with the first three generations of products is pretty simple. There was a distinct lack of differentiation from the other ARM based products around. Yes, NVIDIA brought their graphics prowess to the market, but never in a form that distanced itself adequately from the competition. Tegra 2 boasted GeForce based graphics, but we did not find out until later that it was comprised of basically four pixel shaders and four vertex shaders that had more in common with the GeForce 7800/7900 series than it did with any of the modern unified architectures of the time. Tegra 3 boasted a big graphical boost, but it was in the form of doubling the pixel shader units and leaving the vertex units alone.
While NVIDIA had very strong developer relations and a leg up on the competition in terms of software support, it was never enough to propel Tegra beyond a handful of devices. NVIDIA is trying to rectify that with Tegra 4 and the 72 shader units that it contains (still divided between pixel and vertex units). Tegra 4 is not perfect in that it is late to market and the GPU is not OpenGL ES 3.0 compliant. ARM, Imagination Technologies, and Qualcomm are offering new graphics processing units that are not only OpenGL ES 3.0 compliant, but also offer OpenCL 1.1 support. Tegra 4 does not support OpenCL. In fact, it does not support NVIDIA’s in-house CUDA. Ouch.
Jumping into a new market is not an easy thing, and invariably mistakes will be made. NVIDIA worked hard to make a solid foundation with their products, and certainly they had to learn to walk before they could run. Unfortunately, running effectively entails having design wins due to outstanding features, performance, and power consumption. NVIDIA was really only average in all of those areas. NVIDIA is hoping to change that. Their first salvo into offering a product that offers features and support that is a step above the competition is what we are talking about today.
Subject: General Tech | April 12, 2013 - 02:08 AM | Tim Verry
Tagged: SECO, nvidia, mini ITX, kepler, kayla, GTC 13, GTC, CUDA, arm
Last month, NVIDIA revealed its Kayla development platform that combines a quad core Tegra System on a Chip (SoC) with a NVIDIA Kepler GPU. Kayla will out later this year, but that has not stopped other board makers from putting together their own solutions. One such solution that began shipping earlier this week is the mITX GPU Devkit from SECO.
The new mITX GPU Devkit is a hardware platform for developers to program CUDA applications for mobile devices, desktops, workstations, and HPC servers. It combines a NVIDIA Tegra 3 processor, 2GB of RAM, and 4GB of internal storage (eMMC) on a Qseven module with a Mini-ITX form factor motherboard. Developers can then plug their own CUDA-capable graphics card into the single PCI-E 2.0 x16 slot (which actually runs at x4 speeds). Additional storage can be added via an internal SATA connection, and cameras can be hooked up using the CIC headers.
Rear IO on the mITX GPU Devkit includes:
- 1 x Gigabit Ethernet
- 3 x USB
- 1 x OTG port
- 1 x HDMI
- 1 x Display Port
- 3 x Analog audio
- 2 x Serial
- 1 x SD card slot
The SECO platform is a proving to be popular for GPGPU in the server space, especially with systems like Pedraforca. The intention of using these types of platforms in servers is to save power by using a low power ARM chip for inter-node communication and basic tasks while the real computing is done solely on the graphics cards. With Intel’s upcoming Haswell-based Xeon chips getting down to 13W TPDs though, systems like this are going to be more difficult to justify. SECO is mostly positioning this platform as a development board, however. One use in that respect is to begin optimizing GPU-accelerated code for mobile devices. With future Tegra chips to get CUDA-compatible graphics cards, new software development and optimization of existing GPGPU code for smartphones and tablet will be increasingly important.
Either way, the SECO mITX GPU Devkit is available now for 349 EUR or approximately $360 (in both cases, before any taxes).
Subject: General Tech | November 12, 2012 - 06:29 AM | Tim Verry
Tagged: tesla, supercomputer, nvidia, k20x, HPC, CUDA, computing
Graphics card manufacturer NVIDIA launched a new Tesla K20X accelerator card today that supplants the existing K20 as the top of the line model. The new card cranks up the double and single precision floating point performance, beefs up the memory capacity and bandwidth, and brings some efficiency improvements to the supercomputer space.
While it is not yet clear how many CUDA cores the K20X has, NVIDIA has stated that it is using the GK110 GPU, and is running with 6GB of memory with 250 GB/s of bandwidth – a nice improvement over the K20’s 5GB at 208 GB/s. Both the new K20X and K20 accelerator cards are based on the company’s Kepler architecture, but NVIDIA has managed to wring out more performance from the K20X. The K20 is rated at 1.17 TFlops peak double precision and 3.52 TFlops peak single precision while the K20X is rated at 1.31 TFlops and 3.95 TFlops.
The K20X manages to score 1.22 TFlops in DGEmm, which puts it at almost three times faster than the previous generation Tesla M2090 accelerator based on the Fermi architecture.
Aside from pure performance, NVIDIA is also touting efficiency gains with the new K20X accelerator card. When two K20X cards are paired with a 2P Sandy Bridge server, NVIDIA claims to achieve 76% efficiency versus 61% efficiency with a 2P Sandy Bridge server equipped with two previous generation M2090 accelerator cards. Additionally, NVIDIA claims to have enabled the Titan supercomputer to reach the #1 spot on the top 500 green supercomputers thanks to its new cards with a rating of 2,120.16 MFLOPS/W (million floating point operations per second per watt).
NVIDIA claims to have already shipped 30 PFLOPS worth of GPU accelerated computing power. Interestingly, most of that computing power is housed in the recently unveiled Titan supercomputer. This supercomputer contains 18,688 Tesla K20X (Kepler GK110) GPUs and 299,008 16-core AMD Opteron 6274 processors. It will consume 9 megawatts of power and is rated at a peak of 27 Petaflops and 17.59 Petaflops during a sustained Linpack benchmark. Further, when compared to Sandy Bridge processors, the K20 series offers up between 8.2 and 18.1 times more performance at several scientific applications.
While the Tesla cards undoubtedly use more power than CPUs, you need far fewer numbers of accelerator cards than processors to hit the same performance numbers. That is where NVIDIA is getting its power efficiency numbers from.
NVIDIA is aiming the accelerator cards at researchers and businesses doing 3D graphics, visual effects, high performance computing, climate modeling, molecular dynamics, earth science, simulations, fluid dynamics, and other such computationally intensive tasks. Using CUDA and the parrallel nature of the GPU, the Tesla cards can acheive performance much higher than a CPU-only system can. NVIDIA has also engineered software to better parrellelize workloads and keep the GPU accelerators fed with data that the company calls Hyper-Q and Dynamic Parallelism respectively.
It is interesting to see NVIDIA bring out a new flagship, especially another GK110 card. Systems using the K20 and the new K20X are available now with cards shipping this week and general availability later this month.
You can find the full press release below and a look at the GK110 GPU in our preview.
Anandtech also managed to get a look inside the Titan supercomputer at Oak Ridge National Labratory, where you can see the Tesla K20X cards in action.
Subject: General Tech | May 30, 2012 - 12:11 PM | Jeremy Hellstrom
Tagged: CUDA, open source, opengl
Hack a Day linked to a program that could be of great use for anyone who manipulates and processes images, or anyone who wants to be able to make fractals very quickly. Utilizing the OpenGL Shader Language Reuben Carter developed a command line tool that processes images using NVIDIA GPUs. As we have talked about in the past on PC Perspective, GPUs are much better at this sort of parallel processing than a traditional CPU or the CPU portion on modern processors. Below is one obvious use of this program, the quick creation of complex fractals but this program can also process pre-exisiting images. Edge detection, colour transforms and perhaps even image recognition tasks can be completed with his software at a much faster speed than CPU bound image manipulation programs. If you are in that field, or looking to decorate your dorm room, you should grab his software via the GitHub link in the article.
"If you ever need to manipulate images really fast, or just want to make some pretty fractals, [Reuben] has just what you need. He developed a neat command line tool to send code to a graphics card and generate images using pixel shaders. Opposed to making these images with a CPU, a GPU processes every pixel in parallel, making image processing much faster."
Here is some more Tech News from around the web:
- Hard disk drive prices quick to rise, slow to fall @ The Register
- Microsoft's New User Agreement Bans Class Action Lawsuits @ NGOHQ
- AIDA64 v2.50 is released @ FinalWire
Subject: Shows and Expos | May 15, 2012 - 03:43 PM | Jeremy Hellstrom
Tagged: tesla, nvidia, GTC 2012, kepler, CUDA
SAN JOSE, Calif.—GPU Technology Conference—May 15, 2012—NVIDIA today unveiled a new family of Tesla GPUs based on the revolutionary NVIDIA Kepler GPU computing architecture, which makes GPU-accelerated computing easier and more accessible for a broader range of high performance computing (HPC) scientific and technical applications.
The new NVIDIA Tesla K10 and K20 GPUs are computing accelerators built to handle the most complex HPC problems in the world. Designed with an intense focus on high performance and extreme power efficiency, Kepler is three times as efficient as its predecessor, the NVIDIA Fermi architecture, which itself established a new standard for parallel computing when introduced two years ago.
“Fermi was a major step forward in computing,” said Bill Dally, chief scientist and senior vice president of research at NVIDIA. “It established GPU-accelerated computing in the top tier of high performance computing and attracted hundreds of thousands of developers to the GPU computing platform. Kepler will be equally disruptive, establishing GPUs broadly into technical computing, due to their ease of use, broad applicability and efficiency.”
The Tesla K10 and K20 GPUs were introduced at the GPU Technology Conference (GTC), as part of a series of announcements from NVIDIA, all of which can be accessed in the GTC online press room.
NVIDIA developed a set of innovative architectural technologies that make the Kepler GPUs high performing and highly energy efficient, as well as more applicable to a wider set of developers and applications. Among the major innovations are:
- SMX Streaming Multiprocessor – The basic building block of every GPU, the SMX streaming multiprocessor was redesigned from the ground up for high performance and energy efficiency. It delivers up to three times more performance per watt than the Fermi streaming multiprocessor, making it possible to build a supercomputer that delivers one petaflop of computing performance in just 10 server racks. SMX’s energy efficiency was achieved by increasing its number of CUDA architecture cores by four times, while reducing the clock speed of each core, power-gating parts of the GPU when idle and maximizing the GPU area devoted to parallel-processing cores instead of control logic.
- Dynamic Parallelism – This capability enables GPU threads to dynamically spawn new threads, allowing the GPU to adapt dynamically to the data. It greatly simplifies parallel programming, enabling GPU acceleration of a broader set of popular algorithms, such as adaptive mesh refinement, fast multipole methods and multigrid methods.
- Hyper-Q – This enables multiple CPU cores to simultaneously use the CUDA architecture cores on a single Kepler GPU. This dramatically increases GPU utilization, slashing CPU idle times and advancing programmability. Hyper-Q is ideal for cluster applications that use MPI.
“We designed Kepler with an eye towards three things: performance, efficiency and accessibility,” said Jonah Alben, senior vice president of GPU Engineering and principal architect of Kepler at NVIDIA. “It represents an important milestone in GPU-accelerated computing and should foster the next wave of breakthroughs in computational research.”
NVIDIA Tesla K10 and K20 GPUs
The NVIDIA Tesla K10 GPU delivers the world’s highest throughput for signal, image and seismic processing applications. Optimized for customers in oil and gas exploration and the defense industry, a single Tesla K10 accelerator board features two GK104 Kepler GPUs that deliver an aggregate performance of 4.58 teraflops of peak single-precision floating point and 320 GB per second memory bandwidth.
The NVIDIA Tesla K20 GPU is the new flagship of the Tesla GPU product family, designed for the most computationally intensive HPC environments. Expected to be the world’s highest-performance, most energy-efficient GPU, the Tesla K20 is planned to be available in the fourth quarter of 2012.
The Tesla K20 is based on the GK110 Kepler GPU. This GPU delivers three times more double precision compared to Fermi architecture-based Tesla products and it supports the Hyper-Q and dynamic parallelism capabilities. The GK110 GPU is expected to be incorporated into the new Titan supercomputer at the Oak Ridge National Laboratory in Tennessee and the Blue Waters system at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign.
“In the two years since Fermi was launched, hybrid computing has become a widely adopted way to achieve higher performance for a number of critical HPC applications,” said Earl C. Joseph, program vice president of High-Performance Computing at IDC. “Over the next two years, we expect that GPUs will be increasingly used to provide higher performance on many applications.”
Preview of CUDA 5 Parallel Programming Platform
In addition to the Kepler architecture, NVIDIA today released a preview of the CUDA 5 parallel programming platform. Available to more than 20,000 members of NVIDIA’s GPU Computing Registered Developer program, the platform will enable developers to begin exploring ways to take advantage of the new Kepler GPUs, including dynamic parallelism.
The CUDA 5 parallel programming model is planned to be widely available in the third quarter of 2012. Developers can get access to the preview release by signing up for the GPU Computing Registered Developer program on the CUDA website.
Subject: General Tech, Graphics Cards | January 29, 2012 - 02:53 AM | Scott Michaud
Tagged: nvidia, gpgpu, CUDA
NVIDIA has traditionally been very interested in acquiring room in the high-performance computing for scientific research market. For a lot of functions, having a fast and highly parallel processor saves time and money compared to having a traditional computer crunch away or having to book time with one of the world’s relatively few supercomputers. Despite the raw performance of a GPU, adequate development tools are required to bring the simulation or calculation into a functional program to execute on said GPU. NVIDIA is said to have had a strong lead with their CUDA platform for quite some time; that lead will likely continue with releases the size of this one.
What does a tuned up GPU purr like? Cuda cuda cuda cuda cuda.
The most recent release, CUDA 4.1, has three main features:
- A visual profiler to point out common mistakes and optimizations and to provide instructions which detail how to alter your code to increase your performance
- A new compiler which is based on the LLVM infrastructure, making good on their promise to open the CUDA platform to other architectures -- both software and hardware
- New image and signal processing functions for their NVIDIA Performance Primitives (NPP) library, relieving developers the need to create their own versions or license a proprietary library
The three features, as NVIDIA describes them in their press release, are listed below.
New Visual Profiler - Easiest path to performance optimization
The new Visual Profiler makes it easy for developers at all experience levels to optimize their code for maximum performance. Featuring automated performance analysis and an expert guidance system that delivers step-by-step optimization suggestions, the Visual Profiler identifies application performance bottlenecks and recommends actions, with links to the optimization guides. Using the new Visual Profiler, performance bottlenecks are easily identified and actionable.
LLVM Compiler - Instant 10 percent increase in application performance
LLVM is a widely-used open-source compiler infrastructure featuring a modular design that makes it easy to add support for new programming languages and processor architectures. Using the new LLVM-based CUDA compiler, developers can achieve up to 10 percent additional performance gains on existing GPU-accelerated applications with a simple recompile. In addition, LLVM's modular design allows third-party software tool developers to provide a custom LLVM solution for non-NVIDIA processor architectures, enabling CUDA applications to run across NVIDIA GPUs, as well as those from other vendors.
New Image, Signal Processing Library Functions - "Drop-in" Acceleration with NPP Library
NVIDIA has doubled the size of its NPP library, with the addition of hundreds of new image and signal processing functions. This enables virtually any developer using image or signal processing algorithms to easily gain the benefit of GPU acceleration, with the simple addition of library calls into their application. The updated NPP library can be used for a wide variety of image and signal processing algorithms, ranging from basic filtering to advanced workflows.
Subject: General Tech, Graphics Cards, Processors | December 20, 2011 - 04:34 AM | Scott Michaud
Tagged: nvidia, CUDA, CARMA, capital letters, arm
Okay so the pun was a little obvious, but NVIDIA has just announced the specifications and name for the development kit used to develop for their ARM-based GPU computing platform. The development kit will provide a method to build and test applications on a platform similar to what will be found in the Barcelona Supercomputing Centre’s upcoming GPU supercomputer until you are ready to deploy the finished application with real data on the real machine. Such is the life of a development units.
Carma: What goes around, comes around... right Intel?
The development kit is quite modest in its specifications:
- Tegra3 ARM A9 CPU
- Quadro 1000M GPU (96 CUDA Cores)
- 2GB system RAM, 2GB GPU RAM
- 4x PCIe Gen1 CPU to GPU link
- 1000Base-T networking support
- SATA, HDMI, DisplayPort, USB.