IBM Prepares Power9 CPUs to Power Servers and Supercomputers In 2018

Subject: Processors | September 2, 2016 - 05:39 AM |
Tagged: IBM, power9, power 3.0, 14nm, global foundries, hot chips

Earlier this month at the Hot Chips symposium, IBM revealed details on its upcoming Power9 processors and architecture. The new chips are aimed squarely at the data center and will be used for massive number crunching in big data and scientific applications in servers and supercomputer nodes.

Power9 is a big play from Big Blue, and will help the company expand its precense in the Intel-ruled datacenter market. Power9 processors are due out in 2018 and will be fabricated at Global Foundries on a 14nm HP FinFET process. The chips feature eight billion transistors and utilize an “execution slice microarchitecture” that lets IBM combine “slices” of fixed, floating point, and SIMD hardware into cores that support various levels of threading. Specifically, 2 slices make an SMT4 core and 4 slices make an SMT8 core. IBM will have Power9 processors with 24 SMT4 cores or 12 SMT8 cores (more on that later). Further, Power9 is IBM’s first processor to support its Power 3.0 instruction set.

IBM Power9.jpg

According to IBM, its Power9 processors are between 50% to 125% faster than the previous generation Power8 CPUs depending on the application tested. The performance improvement is thanks to a doubling of the number of cores as well as a number of other smaller improvements including:

  • A 5 cycle shorter pipeline versus Power8
  • A single instruction random number generator (RNG)
  • Hardware assisted garbage collection for interpreted languages (e.g. Java)
  • New interrupt architecture
  • 128-bit quad precision floating point and decimal math support
    • Important for finance and security markets, massive databases and money math.
    • IEEE 754
  • CAPI 2.0 and NVLink support
  • Hardware accelerators for encryption and compression

The Power9 processor features 120 MB of direct attached eDRAM that acts as an L3 cache (256 GB/s). The chips offer up 7TB/s of aggregate fabric bandwidth which certainly sounds impressive but that is a number with everything added together. With that said, there is a lot going on under the hood. Power9 supports 48 lanes of PCI-E 4.0 (2 GB/s per lane per direction), 48 lanes of proprietary 25Gbps accelerator lanes – these will be used for NVLink 2.0 to connect to NVIDIA GPUs as well as to connect to FPGAs, ASICs, and other accelerators or new memory technologies using CAPI 2.0 (Coherent Accelerator Processor Interface) – , and four 16Gbps SMP links (NUMA) used to combine four quad socket Power9 boards into a single 16 socket “cluster.”

These are processors that are built to scale and tackle the big data problems. In fact, not only is Google interested in Power9 to power its services, but the US Department of Energy will be building two supercomputers using IBM’s Power9 CPUs and NVIDI’s Volta GPUs. Summit and Sierra will offer between 100 to 300 Petaflops of computer power and will be installed at Oak Ridge National Laboratory and Lawrence Livermore National Laboratory respectively. There, some of the projects they will tackle is enabling the researchers to visualize the internals of a virtual light water reactor, research methods to improve fuel economy, and delve further into bioinformatics research.

The Power9 processors will be available in four variants that differ in the number of cores and number of threads each core supports. The chips are broken down into Power9 SO (Scale Out) and Power9 SU (Scale Up) and each group has two processors depending on whether you need a greater number of weaker cores or a smaller number of more powerful cores. Power9 SO chips are intended for multi-core systems and will be used in servers with one or two sockets while Power9 SU chips are for multi-processor systems with up to four sockets per board and up to 16 total sockets per cluster when four four socket boards are linked together. Power9 SO uses DDR4 memory and supports a theoretical maximum 4TB of memory (1TB with today’s 64GB DIMMS) and 120 GB/s of bandwidth while Power9 SU uses IBM’s buffered “Centaur” memory scheme that allows the systems to address a theoretical maximum of 8TB of memory (2TB with 64GB DIMMS) at 230 GB/s. In other words, the SU series is Big Blue’s “big guns.”

Power9 SO Die Shot Photo.jpg

A photo of the 24 core SMT4 Power9 SO die.

Here is where it gets a bit muddy. The processors are further broken down by an SMT4 or SMT8 and both Power9 SO and Power9 SU have both options. There are Power9 CPUs with 24 SMT4 cores and there are CPUs with 12 SMT8 cores. IBM indicated that SMT4 (four threads per core) was suited to systems running Linux and virtualization with emphasis on high core counts. Meanwhile SMT8 (eight threads per core) is a better option for large logical partitions (one big system versus partitioning out the compute cluster into smaller VMs as above) and running IBM’s Hypervisor. In either case (24 SMT4 or 12 SMT8) there is the same number of total threads, but you are able to choose whether you want fewer “stronger” threads on each core or more (albeit weaker) threads per core depending on which you workloads are optimized for.

Servers supporting Power9 are already under development by Google and Rackspace and blueprints are even available from the OpenPower Foundation. Currently, it appears that Power9 SO will emerge as soon as the second half of next year (2H 2017) with Power9 SU following in 2018 which would line up with the expected date for the Summit and Sierra supercomputer launches.

This is not a chip that will be showing up in your desktop any time soon, but it is an interesting high performance processor! I will be keeping an eye on updates from Oak Ridge lab hehe.

Samsung and SK Hynix Discuss The Future of High Bandwidth Memory (HBM) At Hot Chips 28

Subject: Memory | August 25, 2016 - 06:39 AM |
Tagged: TSV, SK Hynix, Samsung, hot chips, hbm3, hbm

Samsung and SK Hynix were in attendance at the Hot Chips Symposium in Cupertino, California to (among other things) talk about the future of High Bandwidth Memory (HBM). In fact, the companies are working on two new HBM products: HBM3 and an as-yet-unbranded "low cost HBM." HBM3 will replace HBM2 at the high end and is aimed at the HPC and "prosumer" markets while the low cost HBM technology lowers the barrier to entry and is intended to be used in mainstream consumer products.

As currently planned, HBM3 (Samsung refers to its implementation as Extreme HBM) features double the density per layer and at least double the bandwidth of the current HBM2 (which so far is only used in NVIDIA's planned Tesla P100). Specifically, the new memory technology offers up 16Gb (~2GB) per layer and as many as eight (or more) layers can be stacked together using TSVs into a single chip. So far we have seen GPUs use four HBM chips on a single package, and if that holds true with HBM3 and interposer size limits, we may well see future graphics cards with 64GB of memory! Considering the HBM2-based Tesla will have 16 and AMD's HBM-based Fury X cards had 4GB, HBM3 is a sizable jump!

Capacity is not the only benefit though. HBM3 doubles the bandwidth versus HBM2 with 512GB/s (or more) of peak bandwidth per stack! In the theoretical example of a graphics card with 64GB of HBM3 (four stacks), that would be in the range of 2 TB/s of theoretical maximum peak bandwidth! Real world may be less, but still that is many terabytes per second of bandwidth which is exciting because it opens a lot of possibilities for gaming especially as developers push graphics further towards photo realism and resolutions keep increasing. HBM3 should be plenty for awhile as far as keeping the GPU fed with data on the consumer and gaming side of things though I'm sure the HPC market will still crave more bandwidth.

Samsung further claims that HBM3 will operate at similar (~500MHz) clocks to HBM2, but will use "much less" core voltage (HBM2 is 1.2V).

HBM Four Stacked.jpg

Stacked HBM memory on an interposer surrounding a processor. Upcoming HBM technologies will allow memory stacks with double the number of layers.

HBM3 is perhaps the most interesting technologically; however, the "low cost HBM" is exciting in that it will enable HBM to be used in the systems and graphics cards most people purchase. There were less details available on this new lower cost variant, but Samsung did share a few specifics. The low cost HBM will offer up to 200GB/s per stack of peak bandwidth while being much cheaper to produce than current HBM2. In order to reduce the cost of production, their is no buffer die or ECC support and the number of Through Silicon Vias (TSV) connections have been reduced. In order to compensate for the lower number of TSVs, the pin speed has been increased to 3Gbps (versus 2Gbps on HBM2). Interestingly, Samsung would like for low cost HBM to support traditional silicon as well as potentially cheaper organic interposers. According to NVIDIA, TSV formation is the most expensive part of interposer fabrication, so making reductions there (and somewhat making up for it in increased per-connection speeds) makes sense when it comes to a cost-conscious product. It is unclear whether organic interposers will win out here, but it is nice to seem them get a mention and is an alternative worth looking into.

Both high bandwidth and low latency memory technologies are still years away and the designs are subject to change, but so far they are both plans are looking rather promising. I am intrigued by the possibilities and hope to see new products take advantage of the increased performance (and in the latter case lower cost). On the graphics front, HBM3 is way too far out to see a Vega release, but it may come just in time for AMD to incorporate it into its high end Navi GPUs, and by 2020 the battle between GDDR and HBM in the mainstream should be heating up.

What are your thoughts on the proposed HBM technologies?

Source: Ars Technica

What dwells in the heart of HoloLens? Now we all know!

Subject: General Tech | August 23, 2016 - 04:40 PM |
Tagged: hololens, microsoft, Tensilica, Cherry Trail, hot chips

Microsoft revealed information about the internals of the new holographic processor used in their Hololens at Hot Chips, the first peek we have had.  The new headset is another win for Tensilica as they provide the DSP and instruction extensions; previously we have seen them work with VIA to develop an SSD controller and with AMD for TrueAudio solutions.  Each of the 24 cores has a different task it is hardwired for, offering more efficient processing than software running on flexible hardware.

The processing power for your interface comes from a 14nm Cherry Trail processor with 1GB of DDR and yes, your apps will run on Windows 10.  For now the details are still sparse, there is still a lot to be revealed about Microsoft's answer to VR.  Drop by The Register for more slides and info.


"The secretive HPU is a custom-designed TSMC-fabricated 28nm coprocessor that has 24 Tensilica DSP cores. It has about 65 million logic gates, 8MB of SRAM, and a layer of 1GB of low-power DDR3 RAM on top, all in a 12mm-by-12mm BGA package. We understand it can perform a trillion calculations a second."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

A hint of what to come from Hot Chips

Subject: General Tech | August 25, 2015 - 06:57 PM |
Tagged: amd, hot chips, SK Hynix

Thanks to DigiTimes we are getting some information out of Hot Chips about what is coming up from AMD.  As Sebastian just posted we now have a bit more about the R9 Nano and you can bet we will see more in the near future.  They also describe the new HBM developed in partnership with SK Hynix,  4GB of high-bandwidth memory over a 4096-bit interface will offer an impressive 512Gb/s of memory bandwidth.  We also know a bit more about the new A-series APUs which will range up to 12 compute cores, four Excavator based CPUs and eight GCN based GPUs.  They will also be introducing new power saving features called Adaptive Voltage and Frequency Scaling (AVFS) and will support the new H.265 compression standard.  Click on through to DigiTimes or wait for more pictures and documentation to be released from Hot Chips.


"AMD is showcasing its new high-performance accelerated processing unit (APU), codenamed Carrizo, and the new AMD Radeon R9 Fury family of GPUs, codenamed Fiji, at the annual Hot Chips symposium."

Here is some more Tech News from around the web:

Tech Talk

Source: DigiTimes

Report: Leaked Slide From AMD Gives Glimpse of R9 Nano Performance

Subject: Graphics Cards | August 24, 2015 - 06:37 PM |
Tagged: rumor, report, Radeon R9 Nano, R9 290X, leak, hot chips, hbm, amd

A report from German-language tech site Golem contains what appears to be a slide leaked from AMD's GPU presentation at Hot Chips in Cupertino, and the results paint a very efficient picture of the upcoming Radeon R9 Nano GPU.


The spelling of "performance" doesn't mean this is fake, does it?

While only managing 3 FPS better than the Radeon R9 290X in this particular benchmark, this result was achieved with 1.9x the performance per watt of the baseline 290X in the test. The article speculates on the possible clock speed of the R9 Nano based on the relative performance, and estimates 850 MHz (which is of course up for debate as no official specs are known).

The most compelling part of the result has to be the ability of the Nano to match or exceed the R9 290X in performance, while only requiring a single 8-pin PCIe connector and needing an average of only 175 watts. With a mini-ITX friendly 15 cm board (5.9 inches) this could be one of the more compelling options for a mini gaming rig going forward.

We have a lot of questions that have yet to be answered of course, including the actual speed of both core and HBM, and just how quiet this air-cooled card might be under load. We shouldn't have to wait much longer!


It's not just Broadwell today, we also have Seattle news

Subject: General Tech | August 11, 2014 - 05:47 PM |
Tagged: amd, seattle, hot chips

AMD has been showing off a reference Seattle-based server at Hot Chips and The Tech Report had an opportunity to see it.  Eight 64-bit Cortex-A57 chips are set up in pairs, each pair sharing 1MB of L2 cache while the 8MB of L3 cache is accessible by all eight chips as well as the coprocessors, memory controller, and I/O subsystems.  The system can address up to 128GB of DDR3 or DDR4, and you get support fot 8 SATA 6Gbps ports and 8 lanes of PCIe 3.0 to apportion between the slots.  There is a secure System Control Processor, a partitioned Cortex-A5 core with its own ROM, RAM, and I/O to control power, boot and configuration control with support for TrustZone as well as a Cryptographic Coprocessor which accelerates all encryption processes as you might well expect.  Read on for more information about AMD's unique new take on server technology.


"For some time now, the features of AMD's Seattle server processor have been painted in broad brush strokes. This morning, at the Hot Chips symposium, AMD is filling in most of the missing details. We were treated to an advance briefing last week, where AMD provided previously confidential information about Seattle's cache network, memory controller, I/O features, and coprocessors."

Here is some more Tech News from around the web:

Tech Talk

Come on AMD, spill the beans on Steamroller already

Subject: General Tech | September 6, 2012 - 06:58 PM |
Tagged: vishera, trinity, Steamroller, piledriver, hot chips, bulldozer, amd, Abu Dhabi

You've seen the slides everywhere and read through what Josh could observe and predict from those slides but at the end of Hot Chips will still know little more about the core everyone is waiting for.  The slides show a core little changed from Bulldozer, which is exactly what we've been expecting as AMD has always described Steamroller as a refined Bulldozer design, improving the existing architecture as opposed to a complete redesign.  SemiAccurate did pull out one little gem which might mean good news for both AMD and consumers which pertains to the high density libraries slide.  The 30% decrease in size and power consumption seems to have been implemented by simply using the high density libraries that AMD uses for GPUs.  As this library already exists, AMD didn't need to spend money to develop it, they essentially managed this 30% improvement with a button press, as SemiAccurate put it.  This could well mean that Steamroller will either come out at a comparatively low price or will give AMD higher profit margins ... or a mix of both.


"With that in mind, the HDL slide was rather interesting. AMD is claiming that if you rebuild Bulldozer with an HDL library, the resulting chip has a 30% decrease in size and power use. To AMD at least, this is worth a full shrink, but we only buy that claim if it is 30% smaller and 30% less power hungry, not 30% in aggregate. That said, it is a massive gain with just a button press.

AMD should be applauded, or it would have been, but during the keynote, the one thing that kept going through my mind was, “Why didn’t they do this 5 years ago?”. If you can get 30% from changing out a library to the ones you build your GPUs with, didn’t someone test this out before you decided on layout tools?"

Here is some more Tech News from around the web:

Tech Talk

Source: SemiAccurate

Fee PHI fo fum; Intel changes the smell of a Pentium

Subject: General Tech | September 5, 2012 - 07:49 PM |
Tagged: Xeon Phi, xeon, larrabee, knights corner, Intel, hot chips

The Register is back with more information from Hot Chips about Intel's Xeon Phi coprocessor, which seems to be much more than just a GPU in drag.  Inside the shell you will find at least 50 cores and at least 8GB of GDDR5 graphics, wwith the cores being very heavily modified 22-nanometer Tri-Gate process Pentium P54C chips clocked somewhere between 1.2-1.6GHz.  There is a brand new Vector Processing Unit which processes 512-bit SIMD instructions and sports an Extended Math Unit to handle calculations with hardware not software.  Read on for more details about the high-speed ring interconnects that allow these chips to communicate among themselves and with the Xeon server it will be a part of.


"Intel has been showing off the performance of the "Knights Corner" x86-based coprocessor for so long that it's easy to forget that it is not yet a product you can actually buy. Back in June, Knights Corner was branded as the "Xeon Phi", making it clear that Phi was a Xeon coprocessor even if it does not bear a lot of resemblance to the Xeon processors at the heart of the vast majority of the world's servers."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

A lot of little Phi coprocessors lightens the load

Subject: General Tech | August 31, 2012 - 06:43 PM |
Tagged: Intel, xeon, Xeon Phi, hot chips, larrabee

The Xeon Phi is not Larrabee but it does give a chance to remind people that Intel did at one time swear we would be seeing huge results from a lot of strung together Pentium chips.  Nor is Many Integrated Cores the same as AMD's Magny-cours, although you can be forgiven if that thought popped into your head.  Instead the Xeon Phi is a co-processor that will have 50 or more 512-bit SIMD architecture based processors, each with 512KB of Level 2 cache.  These cores are comparatively slow on their own but have been designed to spread tasks over dozens of cores for parallel processing to make up for the lack of individual power.  Intel sees Phi as a way to create HPC servers which will be physically smaller than one based solely on traditional Xeon based servers as well as being more efficient.  There is still a lot more we need to learn about these chips; until then you can check out The Inquirer's article on Intel's answer to NVIDIA and AMD's HPC cards.


"CHIPMAKER Intel revealed some architectural details of its upcoming Xeon Phi accelerator at the Hotchips conference, saying that the chip will feature 512-bit SIMD units."

Here is some more Tech News from around the web:

Tech Talk

Source: The Inquirer
Subject: Processors
Manufacturer: AMD

HotChips 2012


Ah, the end of August.  School is about to start.  American college football is about to get underway.  Hot Chips is now in full swing.  I guess the end of August caters to all sorts of people.  For the people who are most interested in Hot Chips, the amount of information on next generation CPU architectures is something to really look forward to.  AMD is taking this opportunity to give us a few tantalizing bits of information about their next generation Steamroller core which will be introduced with the codenamed “Kaveri” APU due out in 2013.


AMD is seemingly on the brink of releasing the latest architectural update with Vishera.  This is a Piledriver+ based CPU that will find its way into AM3+ sockets.  On the server side it is expected that the Abu Dhabi processors will also be released in a late September timeframe.  Trinity was the first example of a Piledriver based product, and it showed markedly improved thermals as compared to previous Bulldozer based products, and featured a nice little bump in IPC in both single and multi-threaded applications.  Vishera and Abu Dhabi look to be Piledriver+, which essentially means that there are a few more tweaks in the design that *should* allow it to go faster per clock than Trinity.  There have been a few performance leaks so far, but nothing that has been concrete (or has shown final production-ready silicon).

Until that time when Vishera and its ilk are released, AMD is teasing us with some Steamroller information.  This presentation is featured at Hotchips today (August 28).  It is a very general overview of improvements, but very few details about how AMD is achieving increased performance with this next gen architecture are given.  So with that, I will dive into what information we have.

Click to read the entire article here.