** UPDATE 3/13 5 PM **
AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):
"We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."
"Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes."
So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:
"Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.
For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."
We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.
** END UPDATE **
Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout
Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing.
I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:
The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.
My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.
SMT OFF, 8 cores, 4 workers
With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.
Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:
SMT ON, 16 (logical) cores, 8 workers
With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.
Synthetic Testing Procedure
While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.
This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.
By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions - when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.
Click to Enlarge
This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.
If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well.
Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.
Click to Enlarge
Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.
The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.
Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.
Subject: Processors | June 3, 2016 - 04:55 PM | Jeremy Hellstrom
Tagged: X99, video, Intel, i7-6950X, core i7, Core, Broadwell-E, Broadwell
You have seen our take on the impressively powerful and extremely expensive i7-6950X but of course we were not the only ones to test out Intel's new top of the line processor. Hardware Canucks focused on the difference between the ~$1700 i7-6950X and the ~$1100 i7-6900K. From synthetic benchmarks such as AIDA through gaming at 720p and 1080p, they tested the two processors against each other to see when it would make sense to spend the extra money on the new Broadwell-E chip. Check out what they thought of the chip overall as well as the scenarios where they felt it would be full utilized.
"10 cores, 20 threads, over $1700; Intel's Broadwell-E i7-6950X delivers obscene performance at an eye-watering price. Then there's the i7-6900K which boasts all the same niceties in a more affordable package."
Here are some more Processor articles from around the web:
- Intel Core I7 6950X Extreme Edition Broadwell-E CPU Review @ OCC
- Intel i7-6900K @ Hardwareheaven
- Intel i7-6950X @ Overclockers.com
- Intel Core i7 6950X @ Kitguru
- AMD Athlon X4 845 CPU Review @ Neoseeker
- AMD A10-7860K 65W APU @ techPowerUp
- AMD A10-7890K APU Review @ Neoseeker
It has been nearly two years since the release of the Haswell-E platform, which began with the launch of the Core i7-5960X processor. Back then, the introduction of an 8-core consumer processor was the primary selling point; along with the new X99 chipset and DDR4 memory support. At the time, I heralded the processor as “easily the fastest consumer processor we have ever had in our hands” and “nearly impossible to beat.” So what has changed over the course of 24 months?
Today Intel is launching Broadwell-E, the follow up to Haswell-E, and things look very much the same as they did before. There are definitely a couple of changes worth noting and discussing, including the move to a 10-core processor option as well as Turbo Boost Max Technology 3.0, which is significantly more interesting than its marketing name implies. Intel is sticking with the X99 platform (good for users that might want to upgrade), though the cost of these new processors is more than slightly disappointing based on trends elsewhere in the market.
This review of the new Core i7-6950X 10-core Broadwell-E processor is going to be quick, and to the point: what changes, what is the performance, how does it overclock, and what will it cost you?
That is a lotta SKUs!
The slow, gradual release of information about Intel's Skylake-based product portfolio continues forward. We have already tested and benchmarked the desktop variant flagship Core i7-6700K processor and also have a better understanding of the microarchitectural changes the new design brings forth. But today Intel's 6th Generation Core processors get a major reveal, with all the mobile and desktop CPU variants from 4.5 watts up to 91 watts, getting detailed specifications. Not only that, but it also marks the first day that vendors can announce and begin selling Skylake-based notebooks and systems!
All indications are that vendors like Dell, Lenovo and ASUS are still some weeks away from having any product available, but expect to see your feeds and favorite tech sites flooded with new product announcements. And of course with a new Apple event coming up soon...there should be Skylake in the new MacBooks this month.
Since I have already talked about the architecture and the performance changes from Haswell/Broadwell to Skylake in our 6700K story, today's release is just a bucket of specifications and information surround 46 different 6th Generation Skylake processors.
Intel's 6th Generation Core Processors
At Intel's Developer Forum in August, the media learned quite a bit about the new 6th Generation Core processor family including Intel's stance on how Skylake changes the mobile landscape.
Skylake is being broken up into 4 different line of Intel processors: S-series for desktop DIY users, H-series for mobile gaming machines, U-series for your everyday Ultrabooks and all-in-ones, Y-series for tablets and 2-in-1 detachables. (Side note: Intel does not reference an "Ultrabook" anymore. Huh.)
As you would expect, Intel has some impressive gains to claim with the new 6th Generation processor. However, it is important to put them in context. All of the claims above, including 2.5x performance, 30x graphics improvement and 3x longer battery life, are comparing Skylake-based products to CPUs from 5 years ago. Specifically, Intel is comparing the new Core i5-6200U (a 15 watt part) against the Core i5-520UM (an 18 watt part) from mid-2010.
Going Beyond the Reference GTX 970
Zotac has been an interesting company to watch for the past few years. It is a company that has made a name for themselves in the small form factor community with some really interesting designs and products. They continue down that path, but they have increasingly focused on high quality graphics cards that address a pretty wide market. They provide unique products from the $40 level up through the latest GTX 980 Ti with hybrid water and air cooling for $770. The company used to focus on reference designs, but some years past they widened their appeal by applying their own design decisions to the latest NVIDIA products.
Catchy looking boxes for people who mostly order online! Still, nice design.
The beginning of this year saw Zotac introduce their latest “Core” brand products that aim to provide high end features to more modestly priced parts. The Core series makes some compromises to hit price points that are more desirable for a larger swath of consumers. The cards often rely on more reference style PCBs with good quality components and advanced cooling solutions. This equation has been used before, but Zotac is treading some new ground by offering very highly clocked cards right out of the box.
Overall Zotac has a very positive reputation in the industry for quality and support.
Plenty of padding in the box to protect your latest investment.
Zotac GTX 970 AMP! Extreme Core Edition
The product we are looking at today is the somewhat long-named AMP! Extreme Core Edition. This is based on the NVIDIA GTX 970 chip which features 56 ROPS, 1.75 MB of L2 cache, and 1664 CUDA Cores. The GTX 970 has of course been scrutinized heavily due to the unique nature of its memory subsystem. While it does physically have a 256 bit bus, the last 512 MB (out of 4GB) is addressed by a significantly slower unit due to shared memory controller capacity. In theory the card reference design supports up to 224 GB/sec of memory bandwidth. There are obviously some very unhappy people out there about this situation, but much of this could have been avoided if NVIDIA had disclosed the exact nature of the GTX 970 configuration.
Subject: Processors | September 5, 2011 - 09:52 PM | Tim Verry
Tagged: sandy bridge, pentium, Intel, cpu, Core, celeron, 32nm
Intel today released a price list which included 16 new 32nm processors. The new additions fill in gaps in the Celeron, Pentium, and Core product lines. The new additions are then further broken down into the desktop and mobile camps. On the desktop front, there are four Celeron models ranging from $47 to $52, three Pentium models ranging from $70 to $86, and four new Core i series processors ranging from $127 to $177. Within that range, there are three hyper-threaded dual core Core i3 part and one quad core Core i5 processor.
The mobile additions include one low end and four high end models. On the low end is the dual core Celeron B840 at 1.9GHz with 2 MB L3 cache and 35W TDP. On the high end are four Core i7 chips. The Core i7 2640M is a $346 part and is a hyper-threaded dual core chip at 2.8 GHz, 4 MB L3 cache, and 35W TDP. The Core i7 2760QM is a hyper-threaded quad core part at 2.4 GHz, 6 MB L3 cache, and a 45W TDP. As another 45W TDP part, the Core i7 2860 QM is also a hyper-threaded quad core at 2.5 GHz with 8 MB L3 cache. The highest end mobile chip addition is the Core i7 2960XM, which is a hyper-threaded quad core at 2.7 GHz, a 55W TDP, and 8 MB of L3 cache.
As you can see, there are quite a few new additions filling out the product lineup at various price points and performance segments. See the chart below for the full list and specs.
|Core i5-2320||3.0 GHz||4/4||6MB||95W||$177|
|Core i3-2130||3.4 GHz||2/4||3MB||65W||$138|
|Core i3-2125||3.3 GHz||2/4||3MB||65W||$134|
|Core i3-2120T||2.6 GHz||2/4||3MB||35W||$127|
|Pentium G860||3.0 GHz||2/2||3MB||65W||$86|
|Pentium G630||2.7 GHz||2/2||3MB||65W||$75|
|Pentium G630T||2.3 GHz||2/2||3MB||35W||$70|
|Celeron G540||2.5 GHz||2/2||2MB||65W||$52|
|Celeron G530T||2.0 GHz||2/2||2MB||35W||$47|
|Celeron G530||2.4 GHz||2/2||2MB||65W||$42|
|Celeron G440||1.6 GHz||1/1||1MB||35W||$37|
|Core i7-2960XM||2.7 GHz||4/8||8MB||55W||$1,096|
|Core i7-2860QM||2.5 GHz||4/8||8MB||45W||$568|
|Core i7-2760QM||2.4 GHz||4/8||6MB||45W||$378|
|Core i7-2640M||2.8 GHz||2/4||4MB||35W||$346|
|Celeron B840||1.9 GHz||2/2||2MB||35W||$86|