GDC 2014: Shader-limited Optimization for AMD's GCN

Subject: Editorial, General Tech, Graphics Cards, Processors, Shows and Expos | March 29, 2014 - 10:45 PM |
Tagged: gdc 14, GDC, GCN, amd

While Mantle and DirectX 12 are designed to reduce overhead and keep GPUs loaded, the conversation shifts when you are limited by shader throughput. Modern graphics processors are dominated by sometimes thousands of compute cores. Video drivers are complex packages of software. One of their many tasks is converting your scripts, known as shaders, into machine code for its hardware. If this machine code is efficient, it could mean drastically higher frame rates, especially at extreme resolutions and intense quality settings.

amd-gcn-unit.jpg

Emil Persson of Avalanche Studios, probably known best for the Just Cause franchise, published his slides and speech on optimizing shaders. His talk focuses on AMD's GCN architecture, due to its existence in both console and PC, while bringing up older GPUs for examples. Yes, he has many snippets of GPU assembly code.

AMD's GCN architecture is actually quite interesting, especially dissected as it was in the presentation. It is simpler than its ancestors and much more CPU-like, with resources mapped to memory (and caches of said memory) rather than "slots" (although drivers and APIs often pretend those relics still exist) and with how vectors are mostly treated as collections of scalars, and so forth. Tricks which attempt to combine instructions together into vectors, such as using dot products, can just put irrelevant restrictions on the compiler and optimizer... as it breaks down those vector operations into those very same component-by-component ops that you thought you were avoiding.

Basically, and it makes sense coming from GDC, this talk rarely glosses over points. It goes over execution speed of one individual op compared to another, at various precisions, and which to avoid (protip: integer divide). Also, fused multiply-add is awesome.

I know I learned.

As a final note, this returns to the discussions we had prior to the launch of the next generation consoles. Developers are learning how to make their shader code much more efficient on GCN and that could easily translate to leading PC titles. Especially with DirectX 12 and Mantle, which lightens the CPU-based bottlenecks, learning how to do more work per FLOP addresses the other side. Everyone was looking at Mantle as AMD's play for success through harnessing console mindshare (and in terms of Intel vs AMD, it might help). But honestly, I believe that it will be trends like this presentation which prove more significant... even if behind-the-scenes. Of course developers were always having these discussions, but now console developers will probably be talking about only one architecture - that is a lot of people talking about very few things.

This is not really reducing overhead; this is teaching people how to do more work with less, especially in situations (high resolutions with complex shaders) where the GPU is most relevant.

Taking the A10-7850K out for a spin and leaving marks on the bench

Subject: Processors | March 27, 2014 - 12:44 PM |
Tagged: Kaveri, APU, amd, A10-7850K

It is about time we took a look at AMD's new flagship processor, the A10-7850K Kaveri chip running at 3.7GHz or 4GHz at full boost with 4 Steamroller CPU cores and 8 Hawaii GPU cores.  While we are still shy on HSA benchmarks at the moment, HiTech Legion did have a chance to do some Mantle testing with the APU alone and paired with a discrete GPU which showed off some of the benefits on Mantle.  They also reached a decent overclock, a hair shy of 4.5GHz on air which is not too shabby for a processor that costs under $200.  Check out the full review here.

tech1.jpg

"AMD has launched their fourth generation of APU, codenamed “Kaveri”. Kaveri boasts increased processor power coupled with advanced Radeon graphics but there are other technologies, such as HSA, that balance memory loads via “compute” to both the CPU and GPU."

Here are some more Processor articles from around the web:

Processors

GDC 14: Intel Ready Mode offers low power, always connected desktops

Subject: Processors, Systems | March 19, 2014 - 05:00 PM |
Tagged: ready mode, Intel, gdc 14, GDC

Intel Ready Mode is a new technology that looks to offer some of the features of connected standby for desktop and all-in-one PCs while using new power states of the Haswell architecture to keep power consumption incredibly low.  By combining a 4th Generation Core processor from Intel, a properly implemented motherboard and platform with new Intel or OEM software, users can access the data on their system or push data to their system without "waking up" the machine.

readymode1.jpg

This feature is partially enabled by the C7 state added to the Haswell architecture with the 4th Generation Core processors but could require motherboard and platform providers to update implementations to properly support the incredibly low idle power consumption.  

To be clear, this is not a desktop implementation of Microsoft Instant Go (Connected Standby) but instead is a unique and more flexible implementation.  While MS Instant Go only works on Windows 8 and with Metro applications, Intel Ready Mode will work with Windows 7 and Windows 8 and actually keeps the machine awake and active, just at a very low power level.  This allows users to not only make sure their software is always up to date and ready when they want to use the PC but enabled access to a remote PC from a remote location - all while in this low power state.

How low?  Well Intel has a note on its slide that mentions Fujitsu launched a feature called Low Power Active Mode in 2013 that was able to hit 5 watts when leveraging the Intel guidelines. You can essentially consider this an incredibly low power "awake" state for Intel PCs.

readymode2.jpg
 

Intel offers up some suggested usage models for Ready Mode and I will be interested to see what OEMs integrate support for this technology and if DIY users will be able to take advantage of it as well. Lenovo, ASUS, Acer, ECS, HP and Fujitsu are supporting it this year.

Intel Confirms Haswell-E, 8-core Extreme Edition with DDR4 Memory

Subject: Processors | March 19, 2014 - 05:00 PM |
Tagged: X99, Intel, Haswell-E, gdc 14, GDC, ddr4

While talking with press at GDC in San Francisco today, Intel is pulling out all the stops to assure enthusiasts and gamers that they haven't been forgotten!  Since the initial release of the first Extreme Edition processor in 2003 (Pentium 4), Intel has moved from 1.7 million transistors to over 1.8 BILLION (Ivy Bride-E). Today Intel officially confirms that Haswell-E is coming!

haswelle.jpg

Details are light, but we know now that this latest incarnation of the Extreme Edition processor will be an 8-core design, running on a new Intel X99 chipset and will be the first to support DDR4 memory technology.  I think most of us are going to be very curious about the changes, both in pricing and performance, that the new memory technology will bring to the table for enthusiast and workstation users.

Timing is only listed as the second half of 2014, so we are going to be (impatiently) waiting along with you for more details.

Though based only on leaks that we found last week, the X99 chipset and Haswell-E will continue to have 40 lanes of PCI Express but increases the amount of SATA 6G ports from two to ten (!!) and USB 3.0 ports to six.  

Intel brings Iris Pro Graphics to Broadwell in LGA Sockets

Subject: Processors | March 19, 2014 - 05:00 PM |
Tagged: LGA, iris pro, Intel, gdc 14, GDC, Broadwell

We have great news for you this evening!  The demise of the LGA processor socket for Intel desktop users has been great exaggerated.  During a press session at GDC we learned that not only will Intel be offering LGA based processors for Broadwell upon its release (which we did not get more details on) but that there will be an unlocked SKU with Iris Pro graphics implemented.  

broadwell.jpg

Iris Pro, in its current version, is a high performance version of Intel's processor graphics that includes 128MB of embedded DRAM (eDRAM).  When we first heard that Iris Pro was not coming to the desktop market with an LGA1150 SKU we were confused and bitter but it seems that Intel was listening to feedback.  Broadwell will bring with it the first socketed version of Iris Pro graphics!

It's also nice to know that the rumors surrounding Intel's removal of the socket option for DIY builders was incorrect or possibly diverted due to the reaction. The enthusiast lives on!!

UPDATE: Intel has just confirmed that the upcoming socketed Broadwell CPUs will be compatible with 9-series motherboards that will be released later this spring. This should offer a nice upgrade path for users going into 2015.

Intel Devil's Canyon Offers Haswell with Improved TIM, 9-series Chipsets

Subject: Processors | March 19, 2014 - 05:00 PM |
Tagged: tim, Intel, hawell, gdc 14, GDC, 9-series

An update to the existing Haswell 4th Generation Core processors will be hitting retail sometime in mid-2014 according to what Intel has just told us. This new version of the existing processors will include new CPU packaging and the oft-requested improved thermal interface material (TIM).  Overclockers have frequently claimed that the changes Intel made to the TIM was limiting performance; it seems Intel has listened to the community and will be updating some parts accordingly.

haswellplus.jpg

Recent leaks have indicated we'll see modest frequency increases in some of the K-series parts; in the 100 MHz range.  All Intel is saying today though is what you see on that slide. Overclocks should improve with the new thermal interface material but by how much isn't yet known.

These new processors, under the platform code name of Devil's Canyon, will target the upcoming 9-series chipsets.  When I asked about support for 8-series chipset users, Intel would only say that those motherboards "are not targeted" for the refreshed Haswell CPUs.  I would not be surprised though to see some motherboard manufacturers attempt to find ways to integrate board support through BIOS/UEFI changes.

Though only slight refreshes, when we combine the Haswell Devil's Canyon release with the news about the X99 + Haswell-E, it appears that 2014 is shaping up to be pretty interesting for the enthusiast community!

Intel "Wellsburg" Leaks: Haswell-E's X99 Chipset

Subject: General Tech, Processors, Chipsets | March 13, 2014 - 12:35 AM |
Tagged: Intel, Haswell-E, X99

Though Ivy Bridge-E is not too distant of a memory, Haswell-E is on the horizon. The enthusiast version of Intel's architecture will come with a new motherboard chipset, the X99. (As an aside: what do you think its eventual successor will be called?) WCCFTech got their hands on details, albeit some of which have been kicking around for a few months, outlining the platform.

Intel-X99-Wellsburg-Chipset-635x426.jpg

Image Credit: WCCFTech

First and foremost, Haswell-E (and X99) will support DDR4 memory. Its main benefit is increased bandwidth and decreased voltage at the same current, thus lower wattage. The chipset will support four memory channels.

Haswell-E will continue to have 40 PCIe lanes (the user's choice between five x8 slots or two x16 slots plus a x8 slot). This is the same number of total lanes as seen on Sandy Bridge-E and Ivy Bridge-E. While LGA 2011-3 is not compatible with LGA 2011, it does share that aspect.

X99 does significantly increase the number of SATA ports, to ten SATA 6Gbps (up from two SATA 6Gbps and four SATA 3Gbps). Intel RST, RST Smart Response Technology, and Rapid Recover Technology are also present and accounted for. The chipset also supports six native USB 3.0 ports and an additional eight USB 2.0 ones.

Intel Haswell-E and X99 is expected to launch sometime in Q3 2014.

Source: WCCFTech
Author:
Subject: Processors
Manufacturer: AMD

Low Power and Low Price

Back at CES earlier this year, we came across a couple of interesting motherboards that were neither AM3+ nor FM2+.  These small, sparse, and inexpensive boards were actually based on the unannounced AM1 platform.  This socket is actually the FS1b socket that is typically reserved for mobile applications which require the use of swappable APUs.  The goal here is to provide a low cost, upgradeable platform for emerging markets where price is absolutely key.

am1_01.jpg

AMD has not exactly been living on easy street for the past several years.  Their CPU technologies have not been entirely competitive with Intel.  This is their bread and butter.  Helping to prop the company up though is a very robust and competitive graphics unit.  The standalone and integrated graphics technology they offer are not only competitive, but also class leading in some cases.  The integration of AMD’s GCN architecture into APUs has been their crowning achievement as of late.

This is not to say that AMD is totally deficient in their CPU designs.  Their low power/low cost designs that started with the Bobcat architecture all those years back have always been very competitive in terms of performance, price, and power consumption.  The latest iteration is the Kabini APU based on the Jaguar core architecture paired with GCN graphics.  Kabini will be the part going into the FS1b socket that powers the AM1 platform.

am1_02.jpg

Kabini is a four core processor (Jaguar) with a 128 unit GCN graphics part (8 GCN cores).  These APUs will be rated at 25 watts up and down the stack.  Even if they come with half the cores, it will still be a 25 watt part.  AMD says that 25 watts is the sweet spot in terms of performance, cooling, and power consumption.  Go lower than that and too much performance is sacrificed, and any higher it would make more sense to go with a Trinity/Richland/Kaveri solution.  That 25 watt figure also encompasses the primary I/O functionality that typically resides on a standalone motherboard chipset.  Kabini features 2 SATA 6G ports, 2 USB 3.0 ports, and 8 USB 2.0 ports.  It also features multiple PCI-E lanes as well as a 4x PCI-E connection for external graphics.  The chip also supports DisplayPort, HDMI, and VGA outputs.  This is a true SOC from AMD that does a whole lot of work for not a whole lot of power.

Click here to read the rest of the article!

Samsung Releases 8-Core and 6-Core 32-Bit Exynos 5 SoCs

Subject: Processors | February 26, 2014 - 08:46 PM |
Tagged: SoC, Samsung, exynos 5, big.little, arm, 28nm

Samsung recently announced two new 32-bit Exynos 5 processors with the eight core Exynos 5 Octa 5422 and six core Exynos 5 Hexa 5260. Both SoCs utilize a combination of ARM Cortex-A7 and Cortex-A15 CPU cores along with ARM's Mali graphics. Unlike the previous Exynos 5 chips, the upcoming processors utilize a big.LITTLE configuration variant called big.LITTLE MP that allows all CPU cores to be used simultaneously. Samsung continues to use a 28nm process node, and the SoCs should be available for use in smartphones and tablets immediately.

The Samsung Exynos 5 Octa 5422 offers up eight CPU cores and an ARM Mali T628 MP6 GPU. The CPU configuration consists of four Cortex-A15 cores clocked at 2.1GHz and four Cortex-A7 cores clocked at 1.5GHz. Devices using this chip will be able to tap up to all eight cores at the same time for demanding workloads, allowing the device to complete the computations and return to a lower-power or sleep state sooner. Devices using previous generation Exynos chips were faced with an either-or scenario when it came to using the A15 or A7 groups of cores, but the big.LITTLE MP configuration opens up new possibilites.

Samsung Exynos 5 Hexa 5260.jpg

While the Octa 5422 occupies the new high end for the lineup, the Exynos 5 Hexa 5260 is a new midrange chip that is the first six core Exynos product. This chip uses an as-yet-unnamed ARM Mali GPU along with six ARM cores. The configuration on this SoC is four low power Cortex-A7 cores clocked at 1.3GHz paired with two Cortex-A15 cores clocked at 1.7GHz. Devices can use all six cores at a time or more selectively. The Hexa 5260 offers up two higher powered cores for single threaded performance along with four power sipping cores for running background tasks and parallel workloads.

The new chips offer up access to more cores for more performance at the cost of higher power draw. While the additional cores may seem like overkill for checking email and surfing the web, the additional power can enable things like onboard voice recognition, machine vision, faster photo filtering and editing, and other parallel-friendly tasks. Notably, the GPU should be able to assist with some of this parallel processing, but GPGPU is still relatively new whereas developers have had much more time to familiarize themselves with and optimize applications for multiple CPU threads. Yes, the increasing number of cores lends itself well to marketing, but that does not preclude them from having real world performance benefits and application possibilities. As such, I'm interested to see what these chips can do and what developers are able to wring out of them.

Source: Ars Technica

Video Perspective: Gaming on an Overclocked AMD A10-7850K APU

Subject: Graphics Cards, Processors | February 26, 2014 - 04:18 PM |
Tagged:

Overclocking the memory and GPU clock speeds on an AMD APU can greatly improve gaming performance - it is known.  With the new AMD A10-7850K in hand I decided to do a quick test and see how much we could improve average frame rates for mainstream gamers with only some minor tweaking of the motherboard BIOS.  

Using some high-end G.Skill RipJaws DDR3-2400 memory, we were able to push memory speeds on the Kaveri APU up to 2400 MHz, a 50% increase over the stock 1600 MHz rate.  We also increased the clock speed on the GPU portion of the A10-7850K from 720 MHz to 1028 MHz, a 42% boost.  Interestingly, as you'll see in the video below, the memory speed had a MUCH more dramatic impact on our average frame rates in-game.  

In the three games we tested for this video, GRID 2, Bioshock Infinite and Battlefield 4, total performance gain ranged from 26% to 38%.  Clearly that can make the AMD Kaveri APU an even more potent gaming platform if you are willing to shell out for the high speed memory.

  Stock GPU OC Memory OC Total OC Avg FPS Change
Battlefield 4
1920x1080
Medium
22.4 FPS 23.7 FPS 28.2 FPS 29.1 FPS +29%
GRID 2
1920x1080
High + 2xAA
33.5 FPS 36.3 FPS 41.1 FPS 42.3 FPS +26%
Bioshock Infinite
1920x1080
Low
30.1 FPS 30.9 FPS 40.2 FPS 41.8 FPS +38%