Gunning for Broadwell-E
As I walked away from the St. Regis in downtown San Francisco tonight, I found myself wandering through the streets towards my hotel with something unique in tow. It was a smile. I was smiling, thinking about what AMD had just demonstrated and showed at its latest Zen processor reveal. The importance of this product launch can literally not be overstated for a company struggling to find a foothold to hang on to in a market that it once had a definitive lead. It’s been many years since I left a conference call, or a meeting, or a press conference feeling genuinely hopefully and enthusiastic about what AMD has shown me. Tonight I had that.
AMD’s CEO Lisa Su, and CTO Mark Papermaster, took stage down the street from the Intel Developer Forum to roll out a handful of new architectural details about the Zen architecture while also showing the first performance results comparing it to competing parts from Intel. The crowd in attendance, a mix of media and analysts, were impressed. The feeling was palpable in the room.
It’s late as I write this, and while there are some interesting architecture details to discuss, I think it is in everyone’s best interest that we touch on them lightly for now, and instead refocus on the deep-dive once the Hot Chips information comes out early next week. What you really want to know is clear: can Zen make Intel work again? Can Zen make that $1700 price tag on the Broadwell-E 6950X seem even more ludicrous? Yes.
The Zen Architecture
Much of what was discussed from the Zen architecture is a re-release of what has been out in recent months. This is a completely new, from the ground up, microarchitecture and not a revamp of the aging Bulldozer design. It integrated SMT (simultaneous multi-threading), a first for an AMD CPU, to better take efficient advantage of a longer pipeline. Intel has had HyperThreading for a long time now and AMD is finally joining the fold. A high bandwidth and low latency caching system is used to “feed the beast” as Papermaster put it and utilizing 14nm process technology (starting at Global Foundries) gives efficiency, and scaling a significant bump while enabling AMD to scale from notebooks to desktops to servers with the same architecture.
By far the most impressive claim from AMD thus far was that of a 40% increase in IPC over previous AMD designs. That’s a HUGE claim and is key to the success or failure of Zen. AMD proved to me today that the claims are real and that we will see the immediate impact of that architecture bump from day one.
Press was told of a handful of high level changes to the new architecture as well. Branch prediction gets a complete overhaul. This marks the first AMD processor to have a micro-op cache. Wider execution width with broader instruction schedulers are integrated, all of which adds up to much higher instruction level parallelism to improve single threaded performance.
Performance improvements aside, throughput and efficiency go up with Zen as well. AMD has integrated an 8MB L3 cache and improved prefetching for up 5x the cache bandwidth available per core on the CPU. SMT makes sure the pipeline stays full to prevent “bubbles” that introduce latency and lower efficiency while region-specific power gating means that we’ll see Zen in notebooks as well as enterprise servers in 2017. It truly is an impressive design from AMD.
Summit Ridge, the enthusiast platform that will be the first product available with Zen, is based on the AM4 platform and processors will go up to 8-cores and 16-threads. DDR4 memory support is included, PCI Express 3.0 and what AMD calls “next-gen” IO – I would expect a quick leap forward for AMD to catch up on things like NVMe and Thunderbolt.
The Real Deal – Zen Performance
As part of today’s reveal, AMD is showing the first true comparison between Zen and Intel processors. Sure, AMD showed a Zen-powered system running the upcoming Deus Ex running at 4K with a system powered by the Fury X, but the really impressive results where shown when comparing Zen to a Broadwell-E platform.
Using Blender to measure the performance of a rendering workload (a Zen CPU mockup of course), AMD ran an 8-core / 16-thread Zen processor at 3.0 GHz against an 8-core / 16-thread Broadwell-E processor at 3.0 GHz (likely a fixed clocked Core i7-6900K). The point of the demonstration was to showcase the IPC improvements of Zen and it worked: the render completed on the Zen platform a second or two faster than it did on the Intel Broadwell-E system.
Not much to look at, but Zen on the left, Broadwell-E on the right...
Of course there are lots of caveats: we didn’t setup the systems, I don’t know for sure that GPUs weren’t involved, we don’t know the final clocks of the Zen processors releasing in early 2017, etc. But I took two things away from the demonstration that are very important.
- The IPC of Zen is on-par or better than Broadwell.
- Zen will scale higher than 3.0 GHz in 8-core configurations.
AMD obviously didn’t state what specific SKUs were going to launch with the Zen architecture, what clock speeds they would run at, or even what TDPs they were targeting. Instead we were left with a vague but understandable remark of “comparable TDPs to Broadwell-E”.
Pricing? Overclocking? We’ll just have to wait a bit longer for that kind of information.
There is clearly a lot more for AMD to share about Zen but the announcement and showcase made this week with the early prototype products have solidified for me the capability and promise of this new microarchitecture. We have asked for, and needed, as an industry, a competitor to Intel in the enthusiast CPU space – something we haven’t legitimately had since the Athlon X2 days. Zen is what we have been pining over, what gamers and consumers have needed.
AMD’s processor stars might finally be aligning for a product that combines performance, efficiency and scalability at the right time. I’m ready for it –are you?
Subject: Graphics Cards, Processors | August 17, 2016 - 01:38 PM | Scott Michaud
Tagged: Xeon Phi, larrabee, Intel
Tom Forsyth, who is currently at Oculus, was once on the core Larrabee team at Intel. Just prior to Intel's IDF conference in San Francisco, which Ryan is at and covering as I type this, Tom wrote a blog post that outlined the project and its design goals, including why it didn't hit market as a graphics device. He even goes into the details of the graphics architecture, which was almost entirely in software apart from texture units and video out. For instance, Larrabee was running FreeBSD with a program, called DirectXGfx, that gave it the DirectX 11 feature set -- and it worked on hundreds of titles, too.
Also, if you found the discussion interesting, then there is plenty of content from back in the day to browse. A good example is an Intel Developer Zone post from Michael Abrash that discussed software rasterization, doing so with several really interesting stories.
Subject: General Tech, Processors, Displays, Shows and Expos | August 16, 2016 - 01:50 PM | Ryan Shrout
Tagged: VR, virtual reality, project alloy, Intel, augmented reality, AR
At the opening keynote to this summer’s Intel Developer Forum, CEO Brian Krzanich announced a new initiative to enable a completely untether VR platform called Project Alloy. Using Intel processors and sensors the goal of Project Alloy is to move all of the necessary compute into the headset itself, including enough battery to power the device for a typical session, removing the need for a high powered PC and a truly cordless experience.
This is indeed the obvious end-game for VR and AR, though Intel isn’t the first to demonstrate a working prototype. AMD showed the Sulon Q, an AMD FX-based system that was a wireless VR headset. It had real specs too, including a 2560x1440 OLED 90Hz display, 8GB of DDR3 memory, an AMD FX-8800P APU with R7 graphics embedded. Intel’s Project Alloy is currently using unknown hardware and won’t have a true prototype release until the second half of 2017.
There is one key advantage that Intel has implemented with Alloy: RealSense cameras. The idea is simple but the implications are powerful. Intel demonstrated using your hands and even other real-world items to interact with the virtual world. RealSense cameras use depth sensing to tracking hands and fingers very accurately and with a device integrated into the headset and pointed out and down, Project Alloy prototypes will be able to “see” and track your hands, integrating them into the game and VR world in real-time.
The demo that Intel put on during the keynote definitely showed the promise, but the implementation was clunky and less than what I expected from the company. Real hands just showed up in the game, rather than representing the hands with rendered hands that track accurately, and it definitely put a schism in the experience. Obviously it’s up to the application developer to determine how your hands would actually be represented, but it would have been better to show case that capability in the live demo.
Better than just tracking your hands, Project Alloy was able to track a dollar bill (why not a Benjamin Intel??!?) and use it to interact with a spinning lathe in the VR world. It interacted very accurately and with minimal latency – the potential for this kind of AR integration is expansive.
Those same RealSense cameras and data is used to map the space around you, preventing you from running into things or people or cats in the room. This enables the first “multi-room” tracking capability, giving VR/AR users a new range of flexibility and usability.
Though I did not get hands on with the Alloy prototype itself, the unit on-stage looked pretty heavy, pretty bulky. Comfort will obviously be important for any kind of head mounted display, and Intel has plenty of time to iterate on the design for the next year to get it right. Both AMD and NVIDIA have been talking up the importance of GPU compute to provide high quality VR experiences, so Intel has an uphill battle to prove that its solution, without the need for external power or additional processing, can truly provide the untethered experience we all desire.
Subject: Processors | July 28, 2016 - 02:47 PM | Tim Verry
Tagged: kaby lake, Intel, gt3e, coffee lake, 14nm
Intel will allegedly be releasing another 14nm processor following Kaby Lake (which is itself a 14nm successor to Skylake) in 2018. The new processors are code named "Coffee Lake" and will be released alongside low power runs of 10nm Cannon Lake chips.
Not much information is known about Coffee Lake outside of leaked slides and rumors, but the first processors slated to launch in 2018 will be mainstream mobile chips that will come in U and HQ mobile flavors which are 15W to 28W and 35W to 45W TDP chips respectively. Of course, these processors will be built on a very mature 14nm process with the usual small performance and efficiency gains beyond Skylake and Kaby Lake. The chips should have a better graphics unit, but perhaps more interesting is that the slides suggest that Coffee Lake will be the first architecture where Intel will bring "hexacore" (6 core) processors into mainstream consumer chips! The HQ-class Coffee Lake processors will reportedly come in two, four, and six core variants with Intel GT3e class GPUs. Meanwhile the lower power U-class chips top out at dual cores with GT3e class graphics. This is interesting because Intel has previous held back the six core CPUs for its more expensive and higher margin HEDT and Xeon platforms.
Of course 2018 is also the year for Cannon Lake which would have been the "tock" in Intel's old tick-tock schedule (which is no more) as the chips will move to a smaller process node and then Intel would improve on the 10nm process from there in future architectures. Cannon Lake is supposed to be built on the tiny 10nm node, and it appears that the first chips on this node will be ultra low power versions for laptops and tablets. Occupying the ULV platform's U-class (15W) and Y-class (4.5W), Cannon Lake CPUs will be dual cores with GT2 graphics. These chips should sip power while giving comparable performance to Kaby and Coffee Lake perhaps even matching the performance of the Coffee Lake U processors!
Stay tuned to PC Perspective for more information!
Subject: Processors, Mobile | July 18, 2016 - 12:03 AM | Sebastian Peak
Tagged: softbank, SoC, smartphones, mobile cpu, Cortex-A73, ARM Holdings, arm, acquisition
ARM Holdings is to be aquired by SoftBank for $32 billion USD. This report has been confirmed by the Wall Street Journal, who states that an official annoucement of the deal is likely on Monday as "both companies’ boards have agreed to the deal".
(Image credit: director.co.uk)
"Japan’s SoftBank Group Corp. has reached a more than $32 billion deal to buy U.K.-based chip-designer ARM HoldingsPLC, marking a significant push for the Japanese telecommunications giant into the mobile internet, according to a person familiar with the situation." - WSJ
ARM just announced their newest CPU core, the Cortex-A73, at the end of May, with performance and efficiency improvements over the current Cortex-A72 promised with the new architecture.
(Image credit: AnandTech)
We will have to wait and see if this aquisition will have any bearing on future product development, though it seems the acquisition targets the significant intellectual property value of ARM, whose designs can be found in most smartphones.
Subject: Processors, Mobile | July 11, 2016 - 11:44 AM | Sebastian Peak
Tagged: SoC, Snapdragon 821, snapdragon, qualcomm, adreno 530
Announced today, the Snapdragon 821 offers a modest CPU frequency increase over the Snapdragon 820, with clock speeds of up to 2.4 GHz compared to 2.2 GHz with the Snapdragon 820. The new SoC is still implementing Qualcomm's custom quad-core "Kryo" design, which is made up of two pairs of dual-core CPU clusters.
"What isn’t in this announcement is that the power cluster will likely be above 2 GHz and GPU clocks look to be around 650 MHz but without knowing whether there are some changes other than clock relative to Adreno 530 we can’t really estimate the performance of this part."
Specifics on the Adreno GPU were not mentioned in the official announcement. The 650 MHz GPU clock reported by Anandtech would offer a modest improvement over the SD820's 624 MHz Adreno 530 GPU. Additionally, the "power cluster" will reportedly move from 1.6 GHz with the SD820 to 2.0 GHz with the SD821.
No telling when this updated SoC will find its way into consumer devices, with the Snapdragon 820 currently available in the Samsung Galaxy S7/S7 Edge, LG G5, OnePlus 3, and a few others.
Subject: Graphics Cards, Processors | June 29, 2016 - 07:27 AM | Sebastian Peak
Tagged: RX 490, radeon, processors, Polaris, graphics card, Bristol Ridge, APU, amd, A12-9800
AMD's current "We're in the Game" promotion offers a glimpse at upcoming product names, including the Radeon RX 490 graphics card, and the new Bristol Ridge APUs.
Visit AMD's gaming promo page and click the link to "check eligibility" to see the following list of products, which includes the new product names:
It seems safe to assume that the new products listed - including the Radeon RX 490 - are close to release, though details on the high-end Polaris GPU are not mentioned. We do have details on the upcoming Bristol Ridge products, with this in-depth preview from Josh published back in April. The A12-9800 and A12-9800E are said to be the flagship products in this new 7th-gen lineup, so there will be new desktop parts with improved graphics soon.
Subject: Processors | June 27, 2016 - 02:40 PM | Jeremy Hellstrom
Tagged: dx12, 6700k, Intel, i7-6950X
[H]ard|OCP has been conducting tests using a variety of CPUs to see how well DX12 distributes load between cores as compared to DX11. Their final article which covers the 6700K and 6950X was done a little differently and so cannot be directly compared to the previously tested CPUs. That does not lower the value of the testing, scaling is still very obvious and the new tests were designed to highlight more common usage scenarios for gamers. Read on to see how well, or how poorly, Ashes of the Singularity scales when using DX12.
"This is our fourth and last installment of looking at the new DX12 API and how it works with a game such as Ashes of the Singularity. We have looked at how DX12 is better at distributing workloads across multiple CPU cores than DX11 in AotS when not GPU bound. This time we compare the latest Intel processors in GPU bound workloads."
Here are some more Processor articles from around the web:
- Intel Skylake Graphics: Windows 10 vs. Ubuntu 16.04 + Latest Open-Source Driver Code @ Phoronix
- AMD Wraith Cooler Performance on FX-6350 Black Edition CPU @ Neoseeker
- Athlon X4 880K @ Hardware Secrets
- AMD Athlon X4 845 CPU Review @ OCC
Subject: Processors | June 24, 2016 - 11:15 PM | Scott Michaud
Tagged: Intel, kaby lake, iGPU, h.265, hevc, vp8, vp9, codec, codecs
Fudzilla isn't really talking about their sources, so it's difficult to gauge how confident we should be, but they claim to have information about the video codecs supported by Kaby Lake's iGPU. This update is supposed to include hardware support for HDR video, the Rec.2020 color gamut, and HDCP 2.2, because, if videos are pirated prior to their release date, the solution is clearly to punish your paying customers with restrictive, compatibility-breaking technology. Time-traveling pirates are the worst.
According to their report, Kaby Lake-S will support VP8, VP9, HEVC 8b, and HEVC 10b, both encode and decode. However, they then go on to say that 10-bit VP9 and 10-bit HEVC 10b does not include hardware encoding. I'm not too knowledgeable about video codecs, but I don't know of any benefits to encoding 8-bit HEVC Main 10. Perhaps someone in our comments can clarify.
Subject: Processors | June 21, 2016 - 10:00 PM | Scott Michaud
Update (June 22nd @ 12:36 AM): Errrr. Right. Accidentally referred to the CPU in terms of TFLOPs. That's incorrect -- it's not a floating-point decimal processor. Should be trillions of operations per second (teraops). Whoops! Also, it has a die area of 64sq.mm, compared to 520sq.mm of something like GF110.
So this is an interesting news post. Graduate students at UCDavis have designed and produced a thousand-core CPU at IBM's facilities. The processor is manufactured on their 32nm process, which is quite old -- about half-way between NVIDIA's Fermi and Kepler if viewed from a GPU perspective. Its die area is not listed, though, but we've reached out to their press contact for more information. The chip can be clocked up to 1.78 GHz, yielding 1.78 teraops of theoretical performance.
These numbers tell us quite a bit.
The first thing that stands out to me is that the processor is clocked at 1.78 GHz, has 1000 cores, and is rated at 1.78 teraops. This is interesting because modern GPUs (note that this is not a GPU -- more on that later) are rated at twice the clock rate times the number of cores. The factor of two comes in with fused multiply-add (FMA), a*b + c, which can be easily implemented as a single instruction and are widely used in real-world calculations. Two mathematical operations in a single instruction yields a theoretical max of 2 times clock times core count. Since this processor does not count the factor of two, it seems like its instruction set is massively reduced compared to commercial processors.
If they even cut out FMA, what else did they remove from the instruction set? This would at least partially explain why the CPU has such a high theoretical throughput per transistor compared to, say, NVIDIA's GF110, which has a slightly lower TFLOP rating with about five times the transistor count -- and that's ignoring all of the complexity-saving tricks that GPUs play, that this chip does not. Update (June 22nd @ 12:36 AM): Again, none of this makes sense, because it's not a floating-point processor.
"Big Fermi" uses 3 billion transistors to achieve 1.5 TFLOPs when operating on 32 pieces of data simultaneously (see below). This processor does 1.78 teraops with 0.621 billion transistors.
On the other hand, this chip is different from GPUs in that it doesn't use their complexity-saving tricks. GPUs save die space by tying multiple threads together and forcing them to behave in lockstep. On NVIDIA hardware, 32 instructions are bound into a “warp”. On AMD, 64 make up a “wavefront”. On Intel's Xeon Phi, AVX-512 packs 16, 32-bit instructions together into a vector and operates them at once. GPUs use this architecture because, if you have a really big workload, you, chances are, have very related tasks; neighbouring pixels on a screen will be operating on the same material with slightly offset geometry, multiple vertexes of the same object will be deformed by the same process, and so forth.
This processor, on the other hand, has a thousand cores that are independent. Again, this is wasteful for tasks that map easily to single-instruction-multiple-data (SIMD) architectures, but the reverse (not wasteful in highly parallel tasks that SIMD is wasteful on) is also true. SIMD makes an assumption about your data and tries to optimize how it maps to the real-world -- it's either a valid assumption, or it's not. If it isn't? A chip like this would have multi-fold performance benefits, FLOP for