Subject: General Tech, Graphics Cards, Processors | July 19, 2014 - 03:05 AM | Scott Michaud
Tagged: Xeon Phi, xeon, Intel, avx-512, avx
It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.
This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.
|Conflict Detection Instructions||Yes||Yes||Yes|
|Exponential and Reciprocal Instructions||No||Yes||Yes|
|Byte and Word Instructions||Yes||No||No|
|Doubleword and Quadword Instructions||Yes||No||No|
|Vector Length Extensions||Yes||No||No|
Source: Intel AVX-512 Blog Post (and my understanding thereof).
So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.
Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).
For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.
This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.
Subject: General Tech | July 3, 2014 - 03:17 PM | Ken Addison
Tagged: podcast, video, evga, TORQ X10, Samsung, 850 PRO, ocz, RevoDrive 350, Silverstone, Nightjar, knights landing, Xeon Phi
PC Perspective Podcast #307 - 07/03/2014
Join us this week as we discuss the EVGA Torq X10 Mouse, Samsung 850 Pro, OCZ RevoDrive 350 and more!
The URL for the podcast is: http://pcper.com/podcast - Share with your friends!
- iTunes - Subscribe to the podcast directly through the Store
- RSS - Subscribe through your regular RSS reader
- MP3 - Direct download link to the MP3 file
Hosts: Ryan Shrout, Josh Walrath, Jeremy Hellstrom, and Morry Tietelman
Subject: General Tech, Graphics Cards, Processors | July 2, 2014 - 03:55 AM | Scott Michaud
Tagged: Intel, Xeon Phi, xeon, silvermont, 14nm
Anandtech has just published a large editorial detailing Intel's Knights Landing. Mostly, it is stuff that we already knew from previous announcements and leaks, such as one by VR-Zone from last November (which we reported on). Officially, few details were given back then, except that it would be available as either a PCIe-based add-in board or as a socketed, bootable, x86-compatible processor based on the Silvermont architecture. Its many cores, threads, and 512 bit registers are each pretty weak, compared to Haswell, for instance, but combine to about 3 TFLOPs of double precision performance.
Not enough graphs. Could use another 256...
The best way to imagine it is running a PC with a modern, Silvermont-based Atom processor -- only with up to 288 processors listed in your Task Manager (72 actual cores with quad HyperThreading).
The main limitation of GPUs (and similar coprocessors), however, is memory bandwidth. GDDR5 is often the main bottleneck of compute performance and just about the first thing to be optimized. To compensate, Intel is packaging up-to 16GB of memory (stacked DRAM) on the chip, itself. This RAM is based on "Hybrid Memory Cube" (HMC), developed by Micron Technology, and supported by the Hybrid Memory Cube Consortium (HMCC). While the actual memory used in Knights Landing is derived from HMC, it uses a proprietary interface that is customized for Knights Landing. Its bandwidth is rated at around 500GB/s. For comparison, the NVIDIA GeForce Titan Black has 336.4GB/s of memory bandwidth.
Intel and Micron have worked together in the past. In 2006, the two companies formed "IM Flash" to produce the NAND flash for Intel and Crucial SSDs. Crucial is Micron's consumer-facing brand.
So the vision for Knights Landing seems to be the bridge between CPU-like architectures and GPU-like ones. For compute tasks, GPUs edge out CPUs by crunching through bundles of similar tasks at the same time, across many (hundreds of, thousands of) computing units. The difference with (at least socketed) Xeon Phi processors is that, unlike most GPUs, Intel does not rely upon APIs, such as OpenCL, and drivers to translate a handful of functions into bundles of GPU-specific machine language. Instead, especially if the Xeon Phi is your system's main processor, it will run standard, x86-based software. The software will just run slowly, unless it is capable of vectorizing itself and splitting across multiple threads. Obviously, OpenCL (and other APIs) would make this parallelization easy, by their host/kernel design, but it is apparently not required.
It is a cool way that Intel arrives at the same goal, based on their background. Especially when you mix-and-match Xeons and Xeon Phis on the same computer, it is a push toward heterogeneous computing -- with a lot of specialized threads backing up a handful of strong ones. I just wonder if providing a more-direct method of programming will really help developers finally adopt massively parallel coding practices.
I mean, without even considering GPU compute, how efficient is most software at splitting into even two threads? Four threads? Eight threads? Can this help drive heterogeneous development? Or will this product simply try to appeal to those who are already considering it?
Subject: General Tech, Graphics Cards, Processors | November 28, 2013 - 03:30 AM | Scott Michaud
Tagged: Intel, Xeon Phi, gpgpu
Intel was testing the waters with their Xeon Phi co-processor. Based on the architecture designed for the original Pentium processors, it was released in six products ranging from 57 to 61 cores and 6 to 16GB of RAM. This lead to double precision performance of between 1 and 1.2 TFLOPs. It was fabricated using their 22nm tri-gate technology. All of this was under the Knights Corner initiative.
In 2015, Intel plans to have Knights Landing ready for consumption. A modified Silvermont architecture will replace the many simple (basically 15 year-old) cores of the previous generation; up to 72 Silvermont-based cores (each with 4 threads) in fact. It will introduce the AVX-512 instruction set. AVX-512 allows applications to vectorize 8 64-bit (double-precision float or long integer) or 16 32-bit (single-precision float or standard integer) values.
In other words, packing a bunch of related problems into a single instruction.
The most interesting part? Two versions will be offered: Add-In Boards (AIBs) and a standalone CPU. It will not require a host CPU, because of its x86 heritage, if your application is entirely suited for an MIC architecture; unlike a Tesla, it is bootable with existing and common OSes. It can also be paired with standard Xeon processors if you would like a few strong threads with the 288 (72 x 4) the Xeon Phi provides.
And, while I doubt Intel would want to cut anyone else in, VR-Zone notes that this opens the door for AIB partners to make non-reference cards and manage some level of customer support. I'll believe a non-Intel branded AIB only when I see it.
Subject: General Tech | November 20, 2013 - 12:53 PM | Jeremy Hellstrom
Tagged: Xeon Phi, knights landing, Intel, 14nm
Intel has been talking up the Xeon Phi, first of the Knight's Landing chips which shall arrive in the not too distant future. This new architecture is touted to bring a return of homogeneous systems architecture which will perform parallel processing on its many cores, currently 61 is the number being tossed around, at a level of performance that will exceed the GPU accelerated heterogeneous architecture being pushed by AMD and NVIDIA. Whether this is true or not remains to be seen but many server builders may prefer the familiar CPU only architecture and as at least some of the Phi's will be available in rack mounted form and not just addin cards they may choose Intel out of habit. You can also read about Micron's Automata Processor which The Register reports can outperform a 48-chip cluster of Intel Xeon 5650s in certain scenarios.
"From Intel's point of view, today's hottest trend in high-performance computing – GPU acceleration – is just a phase, one that will be superseded by the advent of many-core CPUs, beginning with Chipzilla's next-generation Xeon Phi, codenamed "Knights Landing"."
Here is some more Tech News from around the web:
- AMD makes more profit than Sony on the PS4 @ The Inquirer
- Decades ago, computing was saved by CMOS. Today, no hero is in sight @ The Register
- Who wants 10TB of FREE cloud storage? Hands down if China is a deal breaker? @ The Register
- Benchmarking Amazon's New EC2 "C3" Instance Types @ Phoronix
- What took you 2 years? LSI finally rolls out next-gen SandForce kit @ The Register
- Beginners Guide: How To Install / Remove an Intel Socket LGA2011 CPU @ PCSTATS
- Hardware.Info UK Awards 2013: vote and win a Zotac GeForce GTX 780 AMP! Edition
Subject: General Tech, Systems | September 9, 2013 - 09:00 AM | Tim Verry
Tagged: Xeon Phi, workstation, quadro, micron, LSI, k6000, Ivy Bridge-EP, firepro, dell
Along with the release of new mobile workstations, Dell announced three new desktop workstations. Specifically, Dell is launching the T3610, T5610, and T7610 PC workstations under its Precision series. The new systems reside in redesigned cases with improved cable management, removable power supplies (tool-less, removable by sliding out from rear panel), and in the case of the T7610 removable hard drives. All of the new Precision workstations have been outfitted with Intel's latest Ivy Bridge-EP based Xeon processors, ECC memory, workstation-class graphics cards from AMD and NVIDIA, Xeon Phi accelerator card options, LSI hardware RAID controllers, and updated software solutions from Intel and Dell.
The new Precision workstations side-by-side. From left to right: T3610, T5610, and T7610.
Dell's Precision T3610 is a the mid-tower system of the group powered by single socket Xeon E5-2600 v2 hardware that further supports up to 128GB DDR3 ECC memory, two graphics cards, three 3.5” hard drives, and four 2.5” SSDs.
The Precision T3610, a new single socket, mid-range workstation.
The Precision T5610 ups the ante to a dual socket IVB-EP processor system that can be configured with up to 128GB DDR3 ECC memory, two AMD FirePro or NVIDIA Quadro (e.g. Quadro K5000) graphics cards, a Tesla K20C accelerator card, three 3.5” hard drives, and four 2.5” solid state drives.
Finally, the T7610 workstation supports dual Intel Ivy Bridge-EP Xeon E5-2600 v2 series processors (up to 24 cores per system), up to 512GB DDR3 ECC memory, three graphics cards (including two NVIDIA Quadro K6000 cards), four 3.5” hard drives, and eight 2.5” SSDs.
Dell's Precision T5610 dual socket workstation.
The new Precision workstations can also be configured with an Intel Xeon Phi 3120A accelerator card in lieu of a Tesla card. The choice will mainly depend on the applications being used and the development resources and expertise available. Both options are designed to accelerate highly parallel workloads in applications that have been compiled to support them. Further, users can add an LSI hardware RAID card with 1GB of onboard memory to the systems. Dell further offers a Micron P320h PCI-E SSD that, while not bootable, offers up 350GB of high performance storage that excels at high sequential reads and writes.
On the software front, Dell is including the Dell Precision Performance Optimizer and the Intel Cache Acceleration Software. The former automatically configures and optimizes the workstation for specific applications based on profiles that are reportedly regularly updated. The other bit of software works to optimize systems that use both hard drives and SSDs with the SSDs as a cache for the mechanical storage. The Intel Cache Acceleration Software configures the caching algorithms to favor caching very large files on the solid state storage. It is a different approach to consumer caching strategies, but one that works well with businesses that use these workstations to process large data sets.
The Dell Precision T7610 workstation.
The Dell workstations are aimed at businesses doing scientific analysis, professional engineering, and complex 3D modeling. The T7610 in particular is aimed at the oil and gas industry for use in simulations and modeling as companies search for new oil deposits.
All three systems will be available for purchase worldwide beginning September 12th. Some of the options, such as 512GB of ECC and the NVIDIA Quadro K6000 on the T7610 will not be available until next month, however. The T3610 has a starting price of $1,099 while the T5610 and T7610 have starting prices of $2,729 and $3,059 respectively.
What are your thoughts on Dell's new mid-tower workstations?
Subject: General Tech, Processors, Systems | June 26, 2013 - 10:27 PM | Scott Michaud
Tagged: supercomputing, supercomputer, titan, Xeon Phi
The National Supercomputer Center in Guangzho, China, will host the the world's fastest supercomputer by the end of the year. The Tianhe-2, English: "Milky Way-2", is capable of nearly double the floating-point performance of Titan albeit with slightly less performance per watt. The Tianhe-2 was developed by China's National University of Defense Technology.
Photo Credit: Top500.org
Comparing new fastest computer with the former, China's Milky Way-2 is able to achieve 33.8627 PetaFLOPs of calculations from 17.808 MW of electricity. The Titan, on the other hand, is able to crunch 17.590 PetaFLOPs with a draw of just 8.209 MW. As such, the new Milky Way-2 uses 12.7% more power per FLOP than Titan.
Titan is famously based on the Kepler GPU architecture from NVIDIA, coupled with several 16-core AMD Opteron server processors clocked at 2.2 GHz. This concept of using accelerated hardware carried over into the design of Tianhe-2, which is based around Intel's Xeon Phi coprocessor. If you include the simplified co-processor cores of the Xeon Phi, the new champion is the sum of 3.12 million x86 cores and 1024 terabytes of memory.
... but will it run Crysis?
... if someone gets around to emulating DirectX in software, it very well could.
Subject: Systems | June 3, 2013 - 09:27 PM | Tim Verry
Tagged: Xeon Phi, tianhe-2, supercomputer, Ivy Bridge, HPC, China
A powerful new supercomputer constructed by Chinese company Inspur is currently in testing at the National University of Defense Technology. Called the Tianhe-2, the new supercomputer has 16,000 compute nodes and approximately 54 Petaflops of peak theoretical compute performance.
Destined for the National Supercomputer Center in Guangzhou, China, the open HPC platform will be used for education and research projects. The Tianhe-2 is composed of 125 racks with 128 compute nodes in each rack.
The compute nodes are broken down into two types: CPM and APU modules. One of each node type makes up a single compute board. The CPM module hosts four Intel Ivy Bridge processors, 128GB system memory, and a single Intel Xeon Phi accelerator card with 8GB of its own memory. Each APU module adds five Xeon Phi cards to every compute board. The compute boards (a CPM module + a APU module) contain two NICs that connect the various compute boards with Inspur's custom THExpress2 high bandwidth interconnects. Finally, the Tianhe-2 supercomputer will have access to 12.4 Petabytes of storage that is shared across all of the compute boards.
In all, the Tianhe-2 is powered by 32,000 Intel Ivy Bridge processors, 1.024 Petabytes of system memory (not counting Phi dedicated memory--which would make the total 1.404 PB), and 48,000 Intel Xeon Phi MIC (Many Integrated Cores) cards. That is a total of 3,120,000 processor cores (though keep in mind that number is primarily made up of the relatively simple individual Phi cores as there are 57 cores to each Phi card).
Inspur claims up to 3.432 TFlops of peak compute performance per compute node (which, for simplicity they break down as one node is 2 Ivy Bridge chips, 64GB memory, and 3 Xeon Phi cards although the two compute modules that make up a node are not physically laid out that way) for a total theoretical potential compute power of 54,912 TFlops (or 54.912 Petaflops) across the entire supercomputer. In the latest Linpack benchmark run, researchers saw up to 63% efficiency in attaining peak performance -- 30.65 PFlops out of 49.19 PFlops peak/theoretical performance -- when only using 14,336 nodes with 50GB RAM each. Further testing and optimization should improve that number, and when all nodes are brought online the real world performance will naturally be higher than the current benchmarks. With that said, the Tianhe-2 is already besting Cray's TITAN, which is promising (though I hope Cray comes back next year and takes the crown again, heh).
In order to keep all of this hardware cool, Inspur is planning a custom liquid cooling system using chilled water. The Tianhe-2 will draw up to 17.6 MW of power under load. Once the liquid cooling system is implemented the supercomputer will draw 24MW while under load.
This is an impressive system, and an interesting take on a supercomputer architecture considering the rise in popularity of heterogeneous architectures that pair massive numbers of CPUs with graphics processing units (GPUs).
The Tianhe-2 supercomputer will be reconstructed at its permanent home at the National Supercomputer Center in Guangzhou, China once the testing phase is finished. It will be one of the top supercomputers in the world once it is fully online! HPC Wire has a nice article with slides an further details on the upcoming processing powerhouse that is worth a read if you are into this sort of HPC stuff.
Also read: Cray unveils the TITAN supercomputer.
Subject: General Tech, Graphics Cards | November 13, 2012 - 01:17 PM | Jeremy Hellstrom
Tagged: amd, Intel, firepro, firepro s10000, HPC, Xeon Phi, 3120A, 5110P, Knight's Corner
AMD's new Tahiti based FirePro S10000 sports a little more than just a GPU upgrade it sports two GPU updates as this is a dual GPU card. According to The Register it should run about $3,600 and need 375W to perform, numbers which make it a more efficient card than the S9000 even though it needs significantly more cash and power to run. It is a 2 slot card, a necessity in the server and workstation world and while it does not support CrossFire it does support EyeFinity with its DVI port and four Mini DisplayPorts.
The Register also got some news about Xeon Phi, Intel's answer to the HPC cards on offer from AMD and Intel. Knights Corner is the evolution of Larrabee into an actual product, in this case two 62 core cards though not all of the cores are active. The passively cooled 5110P has 60 cores running at 1.053GHz, while the 3120A has 57 cores clocked slightly higher at 1.1GHz and sports a fan. Both cards produce just over a teraflop of double precision floating point math, compared to the 1.48 teraflops offered by AMD's S10000 or the 1.3 offered by the Tesla K20x. Check out more on these coprocessors at The Register.
"With the FirePro S10000, not only is the GPU geared down to 825MHz, but the memory is similarly downshifted to 5GHz. The memory interface is 384-bit wide on each GPU, with two blocks of GDDR5 memory yielding a total of 6GB. (This could be a little skinny on the memory for some HPC workloads, given that the S9000 card has 6GB of memory for one Tahiti GPU.) Each GPU can access 240GB/sec of memory bandwidth linking to each 3GB chunk of GDDR5 memory.
Because the card is double-stuffed, it can deliver a very impressive 5.91 teraflops SP and 1.48 teraflops DP in peak floating point oomph."
Here is some more Tech News from around the web:
- The TR Podcast 123: Incremental improvements
- Microsoft Makes Direct X 11.1 a Windows 8 Exclusive @ Slashdot
- Random Linux Commands to Make Google Talk, Fix Wifi, Find Duplicate Files, and More @ Linux.com
- Microsoft Surface RT may only achieve 60% of forecasted sales @ DigiTimes
- Windows chief Steven Sinofsky leaves Microsoft @ The Inquirer
- Fedora 'Spherical Cow' delayed by bugs, Secure Boot @ The Register
- Microsoft rolls out always-on Skype for Windows Phone 8 @ The Register
- Gaming in Windows 8 vs Windows 7: what's the difference in performance? @ Hardware.info
- Windows 7 vs Windows 8 – The Definitive Performance Guide @ hardCOREware
- How to Change the Start Screen Background in Windows 8 @ TechSpot
- TP-Link TL-WDR3600 and WDR4300 review: two shades of black @ Hardware.info
- Win 1 El'Druin ARPG Gaming Mouse, 2 Hellion Gaming Mice and 1 Aegis Gaming Pad @ NikKTech
Subject: General Tech | September 5, 2012 - 03:49 PM | Jeremy Hellstrom
Tagged: Xeon Phi, xeon, larrabee, knights corner, Intel, hot chips
The Register is back with more information from Hot Chips about Intel's Xeon Phi coprocessor, which seems to be much more than just a GPU in drag. Inside the shell you will find at least 50 cores and at least 8GB of GDDR5 graphics, wwith the cores being very heavily modified 22-nanometer Tri-Gate process Pentium P54C chips clocked somewhere between 1.2-1.6GHz. There is a brand new Vector Processing Unit which processes 512-bit SIMD instructions and sports an Extended Math Unit to handle calculations with hardware not software. Read on for more details about the high-speed ring interconnects that allow these chips to communicate among themselves and with the Xeon server it will be a part of.
"Intel has been showing off the performance of the "Knights Corner" x86-based coprocessor for so long that it's easy to forget that it is not yet a product you can actually buy. Back in June, Knights Corner was branded as the "Xeon Phi", making it clear that Phi was a Xeon coprocessor even if it does not bear a lot of resemblance to the Xeon processors at the heart of the vast majority of the world's servers."
Here is some more Tech News from around the web:
- Intel announces two software development suites @ The Inquirer
- Samsung Windows 8 notebook remarkably similar to Asustek Taichi @ DigiTimes
- ZTE plans for 11in 2560x1600 tablet @ The Inquirer
- Acer to launch Windows Phone 8 smartphone in 2013 @ The Register
- Belkin 7-port USB 2.0 Hub F5U307-BRN Review @ PCSTATS
- BIOSTAR Joint Contest @ NikKTech