Feedback

AMD Ryzen and the Windows 10 Scheduler - No Silver Bullet

Subject: Processors
Manufacturer: AMD

** UPDATE 3/13 5 PM **

AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):

  • "We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."

  • "Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows.  Any differences in performance can be more likely attributed to software architecture differences between these OSes."

So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:

  • "Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.

    For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."

We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.

** END UPDATE **

Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout

Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing. 

I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:

View Full Size

The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.

My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.

View Full Size

SMT OFF, 8 cores, 4 workers

With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.

Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:

View Full Size

SMT ON, 16 (logical) cores, 8 workers

With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.

Synthetic Testing Procedure

While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.

This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.

By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions - when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.

View Full Size

Click to Enlarge

This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.

View Full Size

If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well. 

Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.

View Full Size

Click to Enlarge

Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.

The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.

Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.

Continue reading our look at AMD Ryzen and Windows 10 scheduling!

Pinging Cores

One potential pitfall of this testing process might have been seen if Windows was not enumerating the processor logical cores correctly. What if, in our Task Manager graphs above, Windows 10 was accidentally mapping logical cores from different physical cores together? If that were the case, Windows would be detrimentally affecting performance thinking it was moving threads between logical cores on the same physical core when it was actually moving them between physical cores.

To answer that question we went with another custom written C++ application with a very simple premise: ping threads between cores. If we pass a message directly between each logical core and measure the time it takes for it to get there, we can confirm Windows' core enumeration. Passing data between two threads on the same physical core should result in the fastest result as they share local cache. Threads running on the same package (as all threads on the processors technically are) should be slightly slower as they need to communicate between global shared caches. Finally, if we had multi-socket configurations that would be even slower as they have to communicate through memory or fabric.

Let's look at a complicated chart:

View Full Size

What we are looking at above is how long it takes a one-way ping to travel from one logical core to the next. The line riding around 76 ns indicates how long these pings take when they have to travel to another physical core. Pings that stay within the same physical core take a much shorter 14 ns to complete. The above example was a 5960X and confirms that threads 0 and 1 are on the same physical core, threads 2 and 3 are on the same physical core, etc. 

Now lets take a look at Ryzen on the same scale:

View Full Size

There's another layer of latency there, but let us focus on the bottom of the chart first and note that the relative locations of the colored plot lines are arranged identically to that of the Intel CPU. This tells us that logical cores within physical cores are being enumerated correctly ({0,1}, {2,3}, etc.). That's the bit of information we were after and it validates that Windows 10 is correctly enumerating the core structure of Ryzen and thus the scheduling comparisons we made above are 100% accurate. Windows 10 does not have a scheduling conflict on Ryzen processors.

But there are some other important differences standing out here. Pings within the same physical core come out to 26 ns, and pings to adjacent physical cores are in the 42 ns range (lower than Intel, which is good), but that is not the whole story. Ryzen subdivides by what is called a "Core Complex", or CCX for short. Each CCX contains four physical Zen cores and they communicate through what AMD calls Infinity Fabric. That piece of information should click with the above chart, as it appears hopping across CCX's costs another 100 ns of latency, bringing the total to 142 ns for those cases.

While it was not our reason for performing this test, the results may provide a possible explanation for the relatively poor performance seen in some gaming workloads. Multithreaded media encoding and tests like Cinebench segment chunks of the workload across multiple threads. There is little inter-thread communication necessary as each chunk is sent back to a coordination thread upon completion. Games (and some other workloads we assume) are a different story as their threads are sharing a lot of actively changing data, and a game that does this heavily might incur some penalty if a lot of those communications ended up crossing between CCX modules. We do not yet know the exact impact this could have on any specific game, but we do know that communicating across Ryzen cores on different CCX modules takes twice as long as Intel's inter-core communication as seen in the examples above, and 2x the latency of anything is bound to have an impact.

Some of you may believe that there could be some optimization to the Windows scheduler to fix this issue. Perhaps keep processes on one CCX if at all possible. Well in the testing we did, that was also happening. Here is the SMT ON result for a lighter (13%) workload using two threads:

View Full Size

See what's going on there? The Windows scheduler was already keeping those threads within the same CCX. This was repeatable (some runs were on the other CCX) and did not appear to be coincidental. Further, the example shown in the first (bar) chart demonstrated a workload spread across the four cores in CCX 0.

Closing Thoughts

What began as a simple internal discussion about the validity of claims that Windows 10 scheduling might be to blame for some of Ryzen's performance oddities, and that an update from Microsoft and AMD might magically save us all, has turned into a full day with many people chipping in to help put together a great story. The team at PC Perspective believes strongly that the Windows 10 scheduler is not improperly assigning workloads to Ryzen processors because of a lack of architecture knowledge on the structure of the CPU.

In fact, though we are waiting for official comments we can attribute from AMD on the matter, I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance. 

In the process, we did find a new source of information in our latency testing tool that clearly shows differentiation between Intel's architecture and AMD's Zen architecture for core to core communications. In this way at least, the CCX design of 8-core Ryzen CPUs appears to more closely emulate a 2-socket system. With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen. How does this new information affect our expectation of something like Naples that will depend on Infinity Fabric even more directly for AMD's enterprise play?

There is still much to learn and more to investigate as we find the secrets that this new AMD architecture has in store for us. We welcome your discussion, comments, and questions below!


March 10, 2017 | 11:52 PM - Posted by Anonymous (not verified)

When the Ryzen CPU is run with Linx it runs better from what I have seen but its very intresting

March 10, 2017 | 11:55 PM - Posted by Anonymous (not verified)

Does it really surprise you that Linux has vastly superior multi-core performance and is better optimized than Windows in almost every imaginable task?

March 11, 2017 | 06:42 AM - Posted by Peter (not verified)

It's not. 99% of programs and games for PC are optimized purely for Windows 7, 8.1 and 10.

March 11, 2017 | 11:43 AM - Posted by Alex Lustenberg

Would love to see those statisical sources cited, or were you being hyperbolic?

Cross platform support has gotten considerably better over the past few years, and I say that from running Linux on the desktop for almost 20 years.  Also consider what the most used applications on most computers are?  The browser.  Of which the three heavyweights (Chrome, Mozilla/Firefox and Safari/Webkit) have been cross compiled and optimized across all three major platforms (Windows, OSX, Linux) for a very long time.

I would personally love to see how things cross bench between Windows and Linux, for both Intel and AMD based system, but we have never really scoped out the work needed to port over our various benchmark infrastructure bits.

March 16, 2017 | 04:37 AM - Posted by close (not verified)

Why aren't you curious about figures from the guy above generically stating that Linux is vastly superior in every task (it's not)? Especially since you admit you haven't seen any comparative benchmarks.You seem biased and admit to basing your statements on nothing but personal assumptions.
I'd love to see the sources you're citing. I mean that guy is just bringing plastic knives to a gun fight (the kind of guy who goes into dick measuring contest on the web and loses a lot) but I see you're trying to make a point. Why not support it the same way you ask others to?

There are plenty of comparative tests between Linux and Windows contradicting your expectations and regardless of what you might want to think the fact that Windows has ~45 times the market share Linux has (~90% vs. ~2%) makes software developers "somewhat" inclined to optimize for Windows and not spend too much time optimizing for Linux.

So while there's no intrinsic reason why any one OS should be better, popularity makes putting effort into just one of them more practical.

March 11, 2017 | 12:37 PM - Posted by Mike S. (not verified)

Nonsense. I'm posting this from Linux, and in many benchmarks using the same processor Windows blows it away: browsers and gaming in particular.

Outside graphical tasks, Linux tends to do as well or better. Media encoding, web servers, etc...

But your "vastly superior" and "better optimized" assertions don't hold up. We're not going to win people to the free software / open source side by lying.

March 11, 2017 | 01:26 PM - Posted by Anonymous (not verified)

Browsers(The ones that are more popular/widespread) and games are optimized for windows mostly while some Linux OS builds are better at server/HPC workloads. So yes people must take into account that there is relatively a single windows OS Kernel/Application API build mostly with minor feature differences( Home, Pro, etc.)while there are plenty of different Linux OS/API builds running atop of the Linux Kernel.

Steam OS would be the Linux Kernel(Debian based/API/userspace with tweaks for gaming)Build to test against windows and windows does use a proprietary device driver model and graphics API with who knows what hidden features available relative to any Linux/Linux device driver model and OpenGL and Vulkan graphics APIs. If the games/gaming engine folks along with Valve continues there development path under the Vulkan API(Very similar to DX12 and not by accident this time around) then that focus will improve with time.

Windows also has the advantage of having a lot of the OEM UEFI firmware support(laptops in particular) via OEM/Windows bundling arrangements and support from the windows OS maker support in that regard that Linux does not have in the consumer market. This UEFI firmware influence from the windows OS maker over the OEM market has had and can continue to have an adverse affect on any Linux OS in the consumer market usability among other possible engineered performance detriments in the UEFI firmware on many consumer devices.

March 12, 2017 | 12:10 PM - Posted by Mike S. (not verified)

Aren't you just supplying evidence to bolster my point? To be clear, I'm not saying Windows is fundamentally more efficient than Linux on the graphics side. It isn't.

But for right now, device driver software, browser software, and 3D-related software that exists for Windows with respect to graphics and gaming is superior to what is available for Linux. The performance gap is narrowing over time and I'm hopeful that it will disappear or Linux will lead within two or three years.

But today, Windows has the edge and it's dishonest to assert otherwise.

March 12, 2017 | 03:21 PM - Posted by Anonymous (not verified)

Windows has the edge because the Fix was/is in, and M$ is an abusive Monopoly intrest. I'm looking for there to be more investment in Linux for the Desktop/laptop market and I do not care about any M$ performance advantage in gaming only workloads. I want a Ryzen/Vega APU based Linux OS OEM made laptop or I'll never have reason to buy a laptop ever again! I'll not have windows 10 on any of my PCs/laptops. It's Linux Vulkan for me even if M$ retains a gaming performance advantage! What kind of basement dwelling gaming GIT fool do you take me for!

March 12, 2017 | 10:23 PM - Posted by Allyn Malventano

Dare to dream, but you're in the (1.55%) minority.

March 13, 2017 | 10:29 AM - Posted by Anonymous (not verified)

Not really, a quick check of some legal/trade issues regarding M$ says otherwise, and 1.55%(?) A few percentage points may be all that is necessary for maybe some Linux Options, though 5-10% would be very healthy for a market share for a more sustainable ecosystem.

It's not an unobtainable dream for even at your low estimate of 1.55% as there is a Linux OEM laptop market to speak of so avoiding M$ is an option, but avoiding M$, Intel, and Nvidia would be an even better option for an affordable Linux OEM based laptop!

March 13, 2017 | 03:09 PM - Posted by Neez (not verified)

I'll never buy another laptop again. I do all my work on a desktop, and use a 10" tablet to watch movies and netflix or whatever when i'm traveling. Also for light web surfing. I have no reason to buy another laptop. With smartphones getting bigger and bigger, and tablets getting cheaper, as well as microsoft moving into the tablet arena. I don't think many people are going to own an actual laptop in the future. Maybe a tablet with a keyboard case.

March 15, 2017 | 03:55 AM - Posted by Anonymous (not verified)

closer to 40%, unless you think android is running a different kernel

August 1, 2017 | 01:45 PM - Posted by Jeremy Collake (not verified)

Neither OS is perfect. You have Linux, which some distributions finally allow non-pure-FOSS drivers in, and that does make a huge difference.

Why these device drivers are so protected I am not sure, but no good reason other than "it's mine and I don't want to share with any competitor".

This applies not only to graphics, but all the way down to WiFi adaptors.

You can't get the same performance from Linux using a pure-FOSS WiFi driver as you can using a (gasp!) closed-source Linux WiFi driver. At least the latter is available, BUT it is unfortunately also bound to a particular kernel version, making it's use limited.

Windows has the clear advantage with hardware maker and device drivers, which mean a lot. It ALSO has other advantages that Linux hasn't adopted, and vice-versa. Remember, Linux is still building on code decades older than Windows NT.

I think with Windows 10, rolling updates, and the very smart integration of Bash, Microsoft has made the turn after whatever Win8 was. I mean, we all called that. Some boardroom fools thought the PC was dying - as-if. At least Win8 didn't last long.

My perspective? Some code should be open source, some closed source. Open source:

+ Cryptographic libraries
+ Operating System Kernels (excludes Windows for the moment)
+ Any 'cloud' code that parses, modifies, or passes along data - e.g. CloudBleed, a classic buffer overrun that a Google engineer had to tell Cloudflare about. THAT code should be open source since it is parsing so many millions of pages a day. It NEEDS peer review (clearly).
+ Any code related to healthcare devices
+ All banking code (cryptocurrencies helping with that)

Closed source:
+ Applications that need a profit to be maintained. Not everyone can work for free! And, in Windows, it is actually quite expensive to be a developer (code signing, MSDN, etc..). Why these? So some guy doesn't come and fork it and claim it as his own. That's all. Portions can be open-sourced.

We really MUST get past this barrier of 'All software should be free' or 'All open source software should be free'. Neither statement is true, yet held by a large percentage of techies.

March 11, 2017 | 12:49 PM - Posted by Anonymous (not verified)

Microsoft disagrees with Ryan and have tweeted there is an issue as windows 10 doesn't recognize SMT correctly.

March 11, 2017 | 02:39 PM - Posted by Anonymous (not verified)

Where's that tweet?

March 11, 2017 | 03:27 PM - Posted by Anonymous (not verified)

"may be a problem, we are investigating"
does not mean there is an issue.

Learn how to read retard.
You and the blind AMD fanboys are raging over a fucking tweet that doesn't even mean what you think it means. Stop spreading bullshit.

March 13, 2017 | 02:44 PM - Posted by Anonymous (not verified)

retard, fanboy, blind, raging and multiple curse words. You are an intelligent individual arent you.. we need more of you to protect of from reasonable discussions.

August 1, 2017 | 01:51 PM - Posted by Jeremy Collake (not verified)

I will say that I have noticed some SMT discrepancies that I am investigating. So, it would not surprise me for Microsoft to report such.

March 11, 2017 | 12:00 AM - Posted by willmore

Does your BIOS have a setting to limit the number of cores? Can you set it to use just one of the complexes.

The last graph shows that Windows is smart enough to keep one thread bouncing around within a CCX, but what if you have more threads. Is it smart enough to keep all threads in one process on a CCX?

That seems the next logical thing to try.

Good investigation.

March 11, 2017 | 12:09 AM - Posted by Anonymous (not verified)

Exactly. When you test, you're setting up a limited scenario.

What happens in games is entirely different. Windows 10 is not smart enough to keep the threads within a CCX cluster.

Watch this, it shows it really well:

https://www.youtube.com/watch?v=40h4skxDkh4

March 11, 2017 | 12:48 PM - Posted by Thexder (not verified)

Yeah, the games may not get any speed increases from more than 4 cores, but it is entirely possible that they are still creating more than 4 threads which can then end up on different CCX. That could be fixed with scheduler adjustments, or there may be other ways to prevent that from happening. Not sure if that is what is happening here, but based on my experience it seems very likely.

March 11, 2017 | 12:04 AM - Posted by Anonymous (not verified)

I've been trying to say this to everyone I've been talking to about Ryzen on forums etc. I have an 1800x and have been using Process Lasso to mess around with core affinity. I think the issues related to the IMC and RAM subtimings have everything to do with the gaming performance issues. The whole platform is new and all of the parts don't quite play nice together yet.

March 11, 2017 | 12:55 AM - Posted by Ryan Shrout

Very cool tool - had never seen it. Going to give it a shot.

March 11, 2017 | 08:45 AM - Posted by Dresdenboy (not verified)

Here's another one being mentioned in the forums:
http://processhacker.sourceforge.net/

March 11, 2017 | 12:32 PM - Posted by Roy Quader (not verified)

Yeah, it's great. I just purchased a license yesterday (even though you get all features for free). I'm the one who posted as Anonymous up above. I heard about Process Lasso in the thread w/ a detailed technical breakdown of Ryzen by The Stilt. I picked up their core unparking app as well. Really good stuff!

The CCX and thread scheduling theory really got blown out of proportion and everyone started talking about it as if it was a fact. I expect that Ryzen WILL improve, but it will be due to a combination of things: better RAM timings via MB manufacturer access to subtimings, microcode updates, bios fixes for voltage and power delivery, minor Windows/driver patches, and of course software optimizations. Everyone needs to chill out and wait a bit.

I assume that AMD put a tight timeline on this release to keep investors happy and to have some time between Ryzen's launch and Vega's launch. Considering what AMD is up against, the fact that they are fabless, and the fact that they haven't launched a competitive CPU in basically a decade, I think Ryzen is going pretty well so far.

Overall I am really loving my 1800x system given how well it runs with all of the known issues at the moment. If people aren't impressed now, then I'm sure they will be once we start to see some new steppings and the optimizations that will certainly be coming in late 2017 and early 2018. Anyone picking up a Ryzen system right now will see some issues, but I'm very optimistic that many of them will be fixed. I had no real reason to upgrade from my 4790k@4.4ghz but I was pretty excited about having 8 physical cores and I'm even more excited to see how it behaves 1-2 months from now.

March 12, 2017 | 01:38 AM - Posted by Anonymous (not verified)

The issue is that entering the non-local l3 cache incurs a hefty penalty due to the required fabric access.
Keeping threads confined, as much as possible, to their ccx is the solution.
Basically, the issue is that you need to treat each ccx as a socket. The scheduler can do this but it needs to be taught.

March 11, 2017 | 12:50 PM - Posted by Anonymous (not verified)

Ryan guru3d is reporting Microsoft says there is a problem. Seems to contradict you.

March 11, 2017 | 02:07 PM - Posted by Josh Walrath

You mean this article?  http://www.guru3d.com/news-story/microsoft-confirms-windows-bug-is-holding-back-amd-ryzen.html

Did you read the actual contents of the tweet from MS?

March 11, 2017 | 03:30 PM - Posted by Anonymous (not verified)

Exactly this, people just take the news and run with it. Doesn't matter if the tweet was a bunch of random gibberish about dinosaur hunting, if an internet news outlet puts some bullshit together everyone will start citing it.

March 12, 2017 | 12:53 AM - Posted by rukur

Considering tweets are the most useless way to get across information. Linux users are happy with the way Ryzen performs and Microsoft user are complaining about games. What would you expect but may be bugs, we will look into it ?

March 12, 2017 | 06:16 AM - Posted by ET3D (not verified)

If Linux users really cared about game performance they'd be running Windows. Of course they'd say they're happy. :)

March 11, 2017 | 03:30 PM - Posted by Anonymous (not verified)

Seems you're literally retarded.
Tell me, is your reading comprehension on the level of
A: half baked retard
B: 3 year old

March 13, 2017 | 09:33 AM - Posted by kidchunk (not verified)

https://youtu.be/JbryPYcnscA

March 14, 2017 | 01:56 PM - Posted by Dude (not verified)

There is a way to do that from regular task manager:
http://www.tech-recipes.com/rx/37272/set-a-programs-affinity-in-windows-...

March 20, 2017 | 08:47 AM - Posted by Kronus (not verified)

Tried out process lasso for a week or so, it adds some sort of strange overhead that affects I/O.

Middle of gaming I started noticing a delay in my inputs using the gaming setting and sectioning off cores. Not sure if other people have noticed but I don't think it's a good tool to accurately measure performance.

August 1, 2017 | 01:58 PM - Posted by Jeremy Collake (not verified)

It was not made to measure performance, so that would be an accurate statement. ThreadRacer may be more useful, but it needs an update (working on it!).

The I/O may be from the log (can be disabled) OR inter-process communication, though that is very limited.

August 1, 2017 | 01:56 PM - Posted by Jeremy Collake (not verified)

As the author of Process Lasso who just happened to be reading this thread, I am first very grateful to hear it mentioned, and glad people are making good use of it.

It is important to understand that Process Lasso is NOT yet another task manager like Process Explorer or Process Hacker. It is about automation and algorithms.

One algorithm, ProBalance, we have a graphical demo of. And you can write your own demo if you don't trust ours. It is amazing. Even after all these years. It restores system responsiveness during high CPU loads by a marginal and temporary adjustment to select background processes. I simplified it into the product CPUBalance if you don't need anything else.

Otherwise, Process Lasso is great for setting persistent CPU affinities or process priorities classes, etc... I won't post a link as I don't want to sound like a spammer.

I WILL caution against one thing I've seen: Do not OVER-TWEAK. At best, you get a marginal gain in performance. At worst, you destabilize your system or hurt performance. So, keep it limited. All my algorithms are designed around 'do no harm'.

March 11, 2017 | 12:12 AM - Posted by Moravid (not verified)

Ryzen seems to have issues with GPU workloads in general. looking at Tom's Hardware review of Ryzen, CPU results in workstation CPU-centric tasks are comparable to BWE but fall noticeably in GPU composite tests.

http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-9.html

March 11, 2017 | 02:25 AM - Posted by Anonymous (not verified)

I would be interested in seeing if we can detect any change with the latest Nvidia drivers. I couldn't help but notice that r/AMD has reported on abnormally well-optimized drivers for the new 1080 Ti on a Ryzen R7 system.

March 11, 2017 | 12:19 AM - Posted by rukur

If it is not the scheduler to blame then is it the compiler ? You cant be saying people need to code different in all the high level languages for Ryzen.

Linux uses GCC to compile the kernel and and a lot of code.

March 11, 2017 | 01:21 AM - Posted by Allyn Malventano

Well, that's the rub. If Microsoft was to reconfigure to force CCX mapping to separate NUMA nodes, that would keep games and other apps within a node, but the catch is that it would cut the available cores to each application in half (unless they were specifically coded to span NUMA nodes).

March 11, 2017 | 04:29 AM - Posted by Anonymous (not verified)

That's not how NUMA coding works.

It is just an additional check on threads, for physical/logical/node, and if possible, it keeps all threads that have dependencies together in the same node. The non dependent threads can be moved to the other node.

You can see Dual Xeon is handled properly by Windows 10 for gaming, here:

https://youtu.be/40h4skxDkh4?t=4m47s

You guys surely have a dual Xeon somewhere in your lab you can test core loading on in games.

If it's such a disaster, performance would degrade but it does not. These dual Xeons game just fine because Windows is aware of its design and schedules accordingly.

March 11, 2017 | 04:44 AM - Posted by Allyn Malventano

Yes, and if you forward ahead to 5:30 of that same video, he goes on to describe how other apps / games not coded to be NUMA aware can result in *worse* performance. Flipping that configuration over is not the magical solution to this issue as it only exchanges one set of problems for another.

March 11, 2017 | 05:43 AM - Posted by rukur

That is true but he did say that an OS fix would solve the games problems. Considering 4 core 8 thread is the gaming standard having Window 10 keep a game on one CCX might be a "gamer" solution that works in general.

I agree if developers can make the app or games NUMA aware and address all the issues that is the best but many old games will never be patched.

So there is a Windows 10 gaming issue here. Is the solution worse performance than the current state. I guess Microsoft would be looking at it and there is Feb / March patch Tuesday in a few day.

March 11, 2017 | 08:19 AM - Posted by cracklingice (not verified)

The end user can do this themselves with process affinity.
https://youtu.be/Eleu0zEw-Eo?t=10s

March 11, 2017 | 03:38 PM - Posted by cam (not verified)

I believe what you're thinking of is Processor Groups. Because thread affinity in the NT was always represented by a pointer-sized data type (KAFFINITY) with 1 bit per logical CPU, Microsoft had to introduce Processor Groups to support more than 64 CPU cores without breaking existing drivers and applications. As you state, Windows locks un-enlightened processes to a single processor group.

However, this isn't an issue for these smaller Ryzen chips, because Windows uses the minimum number of processor groups necessary to contain all logical CPUs on the machine. Only when you exceed 64 logical CPUs do you end up with more than one processor group.

Source: https://msdn.microsoft.com/en-us/library/windows/desktop/dd405503(v=vs.85).aspx

"Systems with fewer than 64 logical processors always have a single group, Group 0."

"Each node must be fully contained within a group. If the capacities of the nodes are relatively small, the system assigns more than one node to the same group, choosing nodes that are physically close to one another for better performance."

August 1, 2017 | 02:00 PM - Posted by Jeremy Collake (not verified)

Bitmask was the word you were looking for ;).

32 bits = 32 possible core affinities
64 bits = 64 possible core affinities

August 1, 2017 | 02:01 PM - Posted by Jeremy Collake (not verified)

And great response, I just meant to toss in the word you probably had at the tip of your tongue but couldn't recall ;).

March 11, 2017 | 12:24 AM - Posted by Mr.Gold (not verified)

Yes, and NO!

You just proved all the sitre that show that windows10 sheduler is not working correctly.

Why? because we see a repeated 20% win in games by using windows7 VS windows10. (min frame)

The issue is not SMT... its cross complex switching and erroneous cache estimation.

Since cross complex talk at 1/2 DRam speed, thats why dram speed effect gaming that much on windows 10.
Because when windows put a thread from core3 to core4, all the L3 data is non local. and its going over the fabric at 1/2 dram speed. (but on a wider bus)

To prove this. run the same games under windows7 VS windows 10.
massive boost in frame rate consistency.

Also. look at the FPS for a simulated 4 core Ryzen. its like 1:1 to a i7-7700K at equal clock.

Like I suggested, juts try to review the 1800x with half the core (from the same CCX) disabled.
This will eliminated the cross CCX task switching of windows10

This is also why 2166 VS 3200 dram makes a huge difference.

Anyways, you need to think of this problem as cache locality, CCX organization, fabric performance (ram clock).

Not at a 10,000 feet view from task manager graph.

March 11, 2017 | 08:25 AM - Posted by Nils (not verified)

I think you're right. I thought the exact same thing when I read this revealing section:

The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core

March 11, 2017 | 03:17 PM - Posted by cam (not verified)

Switching a thread back to its old core is almost never going to be a win on a non-NUMA platform. It's likely that by the time the old CPU is available again and the thread to be migrated is finished with its scheduling quantum, the current CPU cache is now hotter than the original CPU core. Switching back to the old core means you pay _again_ the penalty of data not being in caches.

This does not apply to true NUMA platforms (with 2 or more memory controllers - not Ryzen with split L3). Since you pay huge latency penalties when accessing memory on another NUMA node, you _always_ want to be on the original node where the memory was allocated (because the kernel gives you memory pages local to your NUMA node).

The difference between Ryzen and a true NUMA platform is that you only pay an initial penalty for CCX migration, until your data populates the cache hierarchy. For real NUMA platforms, you pay a RAM latency penalty for _all_ memory accesses forever, if you don't get back to your ideal node.

March 11, 2017 | 02:15 PM - Posted by Adi Lutvich (not verified)

Exactly. A cross complex switch is essentially an L3 flush. Games are particularly exposed due to their random thread scheduling.

March 11, 2017 | 12:27 AM - Posted by fbifido fbifido (not verified)

SMT OFF, 8 cores, 4 workers

can you do a SMT ON, 16 cores, 4 workers, and show us the graph ?

March 11, 2017 | 01:27 AM - Posted by Allyn Malventano

There appears to have been just enough work going on to spill over onto the second CCX.

March 11, 2017 | 04:37 AM - Posted by Anonymous (not verified)

I wonder why Windows does not know how to treat Ryzen properly? Surely AMD must have known their 2x CCX design would have a problem if scheduling is not aware of the NUMA-like design.

March 11, 2017 | 04:45 AM - Posted by Allyn Malventano

Windows would treat each CCX as a NUMA node if the CPUID coding segmented it accordingly, but that doesn't appear to be the case currently.

March 11, 2017 | 08:40 AM - Posted by Dresdenboy (not verified)

Linux got some code to map APIC IDs to cores, LLCs (=CCX) and sockets. If it needs that, Windows probably needs that too as the CPUID bits can't tell the whole topology it seems.

March 11, 2017 | 09:25 AM - Posted by Anonymous (not verified)

Shouldn't those 4 workers be maxing out one CCX instead of spilling to the other?

March 12, 2017 | 05:31 PM - Posted by Allyn Malventano

It depends. If the scheduler was made aware of the CCX / cache segmentation, it could prioritize one CCX over another, but you might still have workloads that would have behaved better otherwise (i.e. performs slower with 8 threads on only 4 physical cores vs. spread across 8). Point being there is no one size fits all fix here.

March 11, 2017 | 12:33 AM - Posted by Mr.Gold (not verified)

To keep it clear.. SMT is NOT the problem.
The issue is CCX partitioning.

You can force a game affinity to a given CCX, and this does boost performance.

(I dont have my rizen PC built yet to run all those test myself)
But.. the investigation I have seen show that Windows7 doesn't seem to summer from this problem. Dx11 performance under windows7 seem much higher/smoother...

Something is just not right with windows 10. And it could be more then the scheduler moving a given thread from a CCX to another.

Wans't there proof that windows 10 computer cache capacity wrong for Ryzen ? (its adding the cache to total 136MB of cache)
So its wrongly managing cache capacity on Ryzen.

March 11, 2017 | 01:30 AM - Posted by Allyn Malventano

We don't disagree, but the rumors we were attempting to control were those stating the performance issues stemmed from improper handling of physical vs. logical cores.

March 11, 2017 | 12:34 PM - Posted by Anonymous (not verified)

Your article headline and closing summary seem to insinuate a broader conclusion than was actually supported in your analysis.

Since simple attempts otherwise to manually apply thread affinities have largely improved performance by at least small margins, yes, a scheduler change with more heavily weighted de-bouncing would be effective regardless of application level awareness of NUMA clusters.

As a side note, Zen's inter-CCX latency is also clearly not stellar, but very few programs actually have read patterns vulnerable to that, as hardware prefetching handles already does a great job at detecting and accelerating the vastly more common linear copies, either continuous or strided.

March 13, 2017 | 11:05 AM - Posted by Spunjji

THIS.

This article's headline comes across as false because it's too broad. There clearly IS a problem with the scheduler! It's not "the whole problem" but it is significant.

An adjustment to a more qualified headline and conclusion would be more appropriate.

March 12, 2017 | 11:39 PM - Posted by fbifido fbifido (not verified)

Allyn, are you saying windows 7 has the same problem with the Ryzen cpu?

your test, shows that windows 7 & 10 handing phy & log cores the same way?

March 12, 2017 | 06:19 AM - Posted by ET3D (not verified)

Some tests were showing a clear advantage to SMT off, much higher than on Intel. How can this be explained by CCX partitioning?

March 12, 2017 | 11:33 PM - Posted by fbifido fbifido (not verified)

/*
But.. the investigation I have seen show that Windows7 doesn't seem to summer from this problem. Dx11 performance under windows7 seem much higher/smoother...
*/

do you have any links/blogs/videos on this?

March 11, 2017 | 12:39 AM - Posted by fbifido fbifido (not verified)

Allyn Malventano, please run your test on windows 7, to see it you get the same result?

Also this custom C++ app, can you ask AMD to help you rewrite it to use Ryzen to it's fulspec?
must if not all modern compilers are optimized for the intel platform.

thanks

March 11, 2017 | 12:56 AM - Posted by Ryan Shrout

The applications we are working with are super simple - 100-150 lines maximum. I don't think any optimizaiton either way is happening.

March 11, 2017 | 12:56 AM - Posted by Allyn Malventano

The C++ apps are incredibly simple and are only creating threads or pinging between cores. If such a simple app must be rewritten with workarounds just for AMD processors, we have a serious problem.

March 11, 2017 | 12:18 PM - Posted by Anonymous (not verified)

Are your times RTT or half-RTT?

Are you using atomics, simple compiler fenced reads and writes, or something else?

March 12, 2017 | 11:44 PM - Posted by fbifido fbifido (not verified)

can someone please show me these test on windows 7, using the same processors as in this article/blog

Please & Thanks.

March 13, 2017 | 08:05 PM - Posted by Anonymous (not verified)

lmao... another AMD fan is furious that the C++ app is not 'optimized'

March 11, 2017 | 12:44 AM - Posted by Gaz (not verified)

None if this tries to explain where all this started. Windows 7 gives better results in almost all aspects. Windows 10 is to blame. There is no question about that. Question might be that why.

March 14, 2017 | 08:33 AM - Posted by Anonymous (not verified)

Is that not obvious? Windows 10 is beta software. 8-)

March 11, 2017 | 01:33 AM - Posted by Alamo

that was pretty much the same battery of tests that hardware.fr did on release day about memory latency for L3 cache.
Ryzen seem to hit the limit pretty quickly on some apps like gaming, their idea was a limiting unnecessary data transfer between ccx would help keep the bandwidth from saturating, and this is something that can be mitigated by the scheduler at some extent, but ultimately game developers are the ones who will optimze Ryzen the best.
and i read somehwere that the bandwidth of infinity fabric is half of the system ram you use, so im guessing faster ram should help also.
anyway good article, too bad Ryzen 4cores isn't available yet, that would have helped figure things out.

March 11, 2017 | 11:21 AM - Posted by Anonymous (not verified)

Its in the same clock domain so its speed is tied to RAM speed.
Logic dictates that it is at leas as fast as RAM.
Or RAM speed test would show half the speed of intel dual-channel CPUs.
Either it uses double-data rate or frequency multiplier to RAM clock.

Or I may be wrong on how RAM speed tests work.

March 11, 2017 | 01:13 AM - Posted by Anonymous (not verified)

"Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture."

That's not what I believe happens based on what I've read about the new CPU's, what I believe is happening is that Ryzen chips feature 2 clusters of 4 CPU Cores with each cluster having its own L3 Cache and the clusters featuring a super slow interconnect between them.

The Windows schedular is allocating groups of threads that should be allocated to a single CPU Core Cluster, across both clusters resulting in terrible performance as the threads have to communicate with each other CPU Cluster and each CPU Cluster L3 Cache via this very slow interconnect resulting in abysmal performance.

An update to the schedular to better allocate groups of threads to a specific CPU Core Cluster instead of willy nilly across both CPU Clusters without much thought, should result in a notable performance increase.

March 11, 2017 | 01:59 AM - Posted by looncraz (not verified)

Try doing some process affinity stuff...

Because I found some major problems with the Windows 10 scheduler and Ryzen.

https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/pa...

March 11, 2017 | 02:12 AM - Posted by Master Chen (not verified)

Now PcPer lost the last bits of "credibility" I still hoped it was holding up to till now. You're literally showing to everyone around the world your true face, an ugly mug of a completely biased and undoubtedly bribed Intel shekel mongler.

March 11, 2017 | 03:03 AM - Posted by Josh Walrath

Phew, I was worried that we would keep our GODLIKE perspective on the industry! I'm glad we fell to the side and showed our real face with actual results that can be replicated by other parties.  I'm glad we got that out of the way!

March 11, 2017 | 03:52 AM - Posted by pdjblum

With you bro. Well said. He wrote a few lines of c++ and now he thinks he can explain with certainty what many competent engineers are still trying to sort out.

March 11, 2017 | 04:11 AM - Posted by Josh Walrath

Oh, I think the competent engineers at AMD know full well the limitations of their product.  The last thing they want to do is undermine sales, so they keep quiet.  This issue will be fixed in a later generation, but for now they need to sell product.  Ryzen isn't a bad chip, and it is certainly far better than the products before it.  It just is not perfect in every scenario.  Does that make it a bad product? No, not by a long shot.  It does a lot of things very, very well.  It just so happens that one aspect is not perfect.  Does that make it a garbage chip?  Absolutely not.  But it isn't perfect.

March 11, 2017 | 04:24 AM - Posted by pdjblum

Is anything perfect? But why not celebrate what this amazing achievement brings to the masses? Instead it is only 'not a bad chip,' according to you. So one fucking weakness that can be simply avoided by not playing at low res with a card that is designed for high res. is enough for you to decide that is is "not a bad chip." If this is "not a bad chip," then what is the 6900 for more than twice the amount? The 6900 must fucking suck if this "not a bad chip" can match or beat it for half the price and less tdp. He also already wrote off Naples. What the fuck?

March 11, 2017 | 02:16 PM - Posted by Josh Walrath

Who already wrote off Naples?  Also, calm down.  It is not like I am at a soccer match telling you that your son sucks.  I like AMD as much as the next person, but I'm not going to cheerlead products from any side.  Intel overcharges like mad and have excellent margins because of no competition for a long time.  Ryzen is a great step forward for AMD and they will learn from this release.  I would expect Zen+ to address many of the shortcomings of this particular process.

March 11, 2017 | 04:39 PM - Posted by Anonymous (not verified)

Zen/Naples runs fine on the server OS/software systems that are optimized to hide most latency issues related to cache/coherency across cores and CPU sockets and even on windows/consumer currently like you have stated Ryzen is definitely an improvement over what came before.

So on future AMD Ryzen revisions there will be hardware tweaks that favor gaming workloads or even some IP in the hardware/in the CCX units to load balance for cache performance. Really more Information has to come out on the Infinity Fabric after the server products are released and AMD can actually discuss things in more detail.

I think that the main reason that the Infinity Fabric is not completely revealed at this time by AMD is because they still have not gotten the Zen/Naples(And any variants) SKUs to market that uses the Infinity Fabric IP and the Vega SKUs(professional and consumer) SKUs to market that also make use of the Infinity Fabric IP. It appears that AMD’s Infinity Fabric IP included on its Vega/consumer(?) Vega/Newer Radeon Pro WX and Radeon Instinct/AI GPU SKUs will be using the Infinity Fabric in the same manner as Nvidia uses its NVLink IP, for direct attached GPU accelerator usage with CPUs.

Over at S/A Charlie D. discussed some details and implications of the Infinity Fabric but much of that Infinity Fabric IP details are still to this day held up waiting for the NDAs, for the GPU and CPU SKUs(Naples) that make use of the Infinity Fabric, to expire.

It would be nice to see some more looks at Intel’s ring bus versus some Infinity Fabric workload testing once all the Infinity Fabric details are known. But AMD has definitely moved to a modular/scalable design ethos with its Zen/Naples server designs one fourth of which was used to make the Zen/Ryzen consumer(2 CCX unit, 8 core/16 threads) SKUs. The Navi GPU designs may be made modular in that same way, with smaller die sizes giving better Die/Wafer yields for both GPUs and CPUs.

March 16, 2017 | 12:25 AM - Posted by Anonymous (not verified)

wow,...so..this the kind of human called "fanboy"...whoaa...so scary!!..

March 11, 2017 | 04:50 AM - Posted by Allyn Malventano

We've been communicating with AMD throughout our testing process and they agree with our results, and that 'few lines of code' is doing a lot better than your armchair commenting.

March 11, 2017 | 09:05 AM - Posted by Harney (not verified)

:)

To be fair he does seem comfortable while he is commenting so he's not doing that bad ;)

http://livedoor.blogimg.jp/yukawanet/imgs/d/c/dccc43f2.jpg

March 14, 2017 | 12:15 AM - Posted by Sacco (not verified)

well said.

March 11, 2017 | 05:03 AM - Posted by Allyn Malventano

Here, let me do the work for you and provide the relevant quote from the article:

"I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance."

Our credibility is holding up just fine with AMD. What's your excuse? From where I sit, it looks like you are the one who is lacking credibility and is biased towards your favorite product, unable to deal with the fact that it is not as perfect as you were hoping it would be. Ryzen is still a great product, especially for the money, but it's just not the best at everything (particularly gaming).

Apologies if we hurt your feelings with actual results.

March 11, 2017 | 08:49 AM - Posted by Anonymous (not verified)

"I have been told from high knowledge individuals"

Yeah, you totally didn't make this up. I hope Intel pays you a lot of shekels for this, though.

March 11, 2017 | 09:59 AM - Posted by Ryan Shrout

I'm actually confused here. Please tell me how this information benefits Intel at all?

March 11, 2017 | 12:46 PM - Posted by Anonymous (not verified)

And that is why people don't believe you or follow your site anymore.

March 11, 2017 | 01:17 PM - Posted by CNote

You sure seem to? Bored in your moms basement?

March 11, 2017 | 07:23 PM - Posted by Anonymous (not verified)

Nah, I'm alright. What does Krzanich's cock taste like?

March 12, 2017 | 03:47 AM - Posted by Master Chen (not verified)

"Intel's e-mails on how people SHOULD (((((review))))) Zen OR ELSE? Nope, never 'eard of 'em!"
Classy, Ryan. Just classy.

March 12, 2017 | 03:34 PM - Posted by Anonymous (not verified)

That's across the entire market of web based review sites! Maybe your elected officials/federal trade agencies are not enfrocing the laws already on the books! But don't look for any relief in the current regulatory climate. It's all apples to orange demagoguery now! As a diversion to keep the blood blind fools focused away from the royal scam that is going down!

And they wandered in
From the city of St. John
Without a dime
Wearing coats that shined
Both red and green
Colors from their sunny island
From their boats of iron
They looked upon the promised land
Where surely life was sweet
On the rising tide
To New York City
Did they ride into the street
See the glory
Of the royal scam
.
.
.

March 12, 2017 | 08:33 PM - Posted by Master Chen (not verified)

>That's across the entire market of web based review sites!
Just a daily reminder that Elric and LPT guys bought their own Zen CPUs and motherboards for their own money, no one sent any "presents" to them. And they weren't the only ones. Just Saiyan.

March 12, 2017 | 09:29 PM - Posted by Anonymous (not verified)

I do not know of these "Elric and LPT guys" and all of the midsize to large Web based Review sites rely on free samples even S/A uses some free samples. You just want to split hairs and sublimate your anger issues by constantly complaining about one website and ignoring the operating climate that all the websites operate under. Take it out on the inherent conflicts of interests in the entire system and not one single website. Even Phoronix spins things but no more than any other website, Tom’s etc! All this smiting of one with out references to the others is not going to get you anywhere.

Go to AnandTech’s Ryzen: Strictly technical forum and read the forum posts, do not attempt to post there, you will be blocked as soon as you start to try any unrelated line of speculative accusations!

March 13, 2017 | 04:36 AM - Posted by Master Chen (not verified)

"Do not know Elric"?
Um, man, he's the original creator of the motherboards.org (one of the most highly respected and notorious hardware outlets out there since the mid-90s, that is - until he got the site stolen from him by a couple of assholes. Literally. Motherboards.org and his original YouTube channel associated with the site were stolen from Elric) that currently runs Tech of Tomorrow. This dude is as hardcore and as oldfag as it gets. Compared to people like him, Ryan Shrout and Jeremiah Hellstrom are friggin' toddling babies.

March 13, 2017 | 10:53 AM - Posted by Anonymous (not verified)

You digress so much and waving your "old school back in the days" refrences around like they are an extention of your "too little member made bigger" ego issues has no bearing in reality.

The Anandtech Ryzen: Strictly Technical forum has its opinions on the matter with Allyn even chiming in there himself. I like the Peer review process over at that Ryzen: Strictly Technical forum, and there will be even better results to be had once Zen/Naples gets to market with the professional server/HPC market Journals getting in on the bisecting the problem with respect to scheduling on the Zen/Ryzen New CCX-Unit construct.

March 13, 2017 | 02:05 PM - Posted by Master Chen (not verified)

Will you stop that pointless bickering of yours already?

March 13, 2017 | 03:44 PM - Posted by Anonymous (not verified)

And it goes full circle with you, in a neverending cycle of your pathologically driven machinations of apportioning blame in an unrealistic manner while arguing with everyone and their dog!

Regress, digress, and rope-a-dope with even more pathologically driven machinations! And then it's rinse and repeat, all over again!

March 11, 2017 | 12:56 PM - Posted by Anonymous (not verified)

Come on surely you must know by now that the incompetent morons on the Internet need to be made to feel good about their purchasing decisions. Don't bring potential facts to the table or knowledge from someone inside a company,they mean nothing to the simple minded.

March 11, 2017 | 07:14 PM - Posted by Anonymous (not verified)

Hearsay from an anonymous authority is now a fact? Jesus, you're stupid.

March 11, 2017 | 02:04 PM - Posted by Anonymous (not verified)

I've been told by high knowledge individuals from AMD that you are making this up. Source?

I'll show you mine if you show me yours....

March 12, 2017 | 05:08 AM - Posted by Martin Trautvetter

"I'll show you mine if you show me yours...."

Let me venture a guess here: That is your best pickup line and you're still bewildered why people aren't taking you up on what is clearly a fantastic deal.

:P

March 11, 2017 | 05:30 PM - Posted by Anonymous (not verified)

Man! You are one partisan with respect to treating this technology issue as a football match more than the technical design choice(With Tradeoffs) on the part of AMD that it was.

Ryzen is made from that 2 CCX unit modular die that can be scaled up to produce any larger scalable multi-die on MCM, or maybe on interposer, design like the on MCM Zen/Naples server design announced. AMD designed the CCX unit and Infinity Fabric IP with an emphasis on AMD’s new scalable/modular methods on purpose. And AMD designed the modular CCX Unit with full knowledge of any minor limitations that may result that AMD still has plenty of time for some necessary tweaking going froward to fix any issues that can be fixed and AMD will move on from there.

It’s not like AMD is not currently testing any Zen+/Zen2 designs as we post! So give it a break with some of the issues that you are obsessed with being only issues concerning a single web based review/enthusiasts website. There are issues across the entire review/enthusiasts website sphere but that is best discussed in the proper context and not with respect to this design issue from AMD.

Ryzen is doing just fine so far and most folks are happy and they do know that tweaks are always in order for any new design that is newly released. You illogical Butthurt is enlarging to encompass the entire universe!

March 12, 2017 | 03:44 AM - Posted by Master Chen (not verified)

Always remember guys, this is the """"""tech news""""" site that openly proclaimed in the past that RX 480 is 60% below GTX 1060 and that a GTX 1050Ti is a "better choice than RX 480". I'll let you to decide on whether to trust them or not.

March 12, 2017 | 11:22 PM - Posted by Allyn Malventano

If those were the results, then those were the results. We are capturing and analyzing frames directly from the output of each card, so there's not really any room for error in what we are seeing from the cards we are comparing. Just because you don't like the results does not make them any less factual. If you are so unhappy with our results, you should be spending your time reading sites you deem more reputable instead of astroturfing in our comments section.

March 13, 2017 | 03:00 AM - Posted by Anonymous (not verified)

I seem to remember when this site reviewed the Radeon Pro Duo, the selection of games all had nvidia money behind them. Your testing method and results may have been rigorous and accurate, but were baked none the less.

March 13, 2017 | 04:43 AM - Posted by Master Chen (not verified)

"Rigorous and accurate"? Don't make me laugh even harder than I already am when I'm reading all of this BS of theirs. They were completely and utterly pulled out of their asses, which was debunked by the general PC enthusiast commune many times over all over the world.

March 13, 2017 | 05:21 AM - Posted by Exhale (not verified)

You must really suffer from buyers remorse! As in i've seen you take on this personal crusade to protect AMD on so many sites and you always do it with empty claims and a very bad language and ill attitude.
I'm actually sorry that Intel and/or Nvidia stole you house, murdered your familly and left you without a job...oh, wait.

March 13, 2017 | 02:09 PM - Posted by Master Chen (not verified)

"Buyer's remorse"? Kiddo, if you'd read what I've been saying here for the past two or so years, you'd learn to memorize by now that my current main system is on i7 2600K. And that's just one out of six different custom builds I have in my house right now, three on Intel and three on AMD. What "buyer's remorse", lol? Don't make me laugh with that crap of yours.

March 13, 2017 | 03:56 PM - Posted by Anonymous (not verified)

Your parents suffer from parents' Remorse, looking at your current and repeated line of illogical torts and retorts!

March 13, 2017 | 05:40 PM - Posted by Allyn Malventano

"Debunked many times over all of the world"? Don't make me laugh even harder than I already am while I'm reading AMD's now official statements confirming our results.

March 13, 2017 | 10:47 AM - Posted by Sam AMITTAI (not verified)

It's not just the infinity fabric latency,
Ryzen 1800X 8% behind Kapy-Lake in IPC and 12 % behind Kapy-Lake in clock speed.

If you but all these three eliminates together you will have a good idea about what's happening.

AMD did great job with Ryzen but they have a lot of work to do with Zen 2.

It's highly possible to fix everything and catch up with Intel next year, people need to understand there is no magic these days.

March 13, 2017 | 12:47 AM - Posted by Tim Verry

When the heck did we ever say those things? AFAIK we never reviewed a 1050 Ti (we did do that upgrade story i guess) and from the GTX 1060 review:

"At $299, the GeForce GTX 1060 would be a tough sell against a $239 Radeon RX 480 with 8GB of memory. At $249 though, just a $10 premium over the RX 480, the GTX 1060 would be my choice. I have both an ASUS and an EVGA card in my hands that they assure me will be sold at those prices. If that’s the case, and it maintains those prices at typical sales outlets for a reasonable time frame, then the GeForce GTX 1060 will be the mainstream graphics card of choice. How other vendors justify the higher priced models ($289-299) with better cooling and features, will be judged on a case by case basis. But I expect it to be a tough battle in this price segment."
 

How are you getting that we said it was 60% faster than the RX 480? (Looking through the game pages of the review I saw up to 20% in all but GTA:V where it was 36% faster than RX 480 at 1080p and those are the worst case numbers, RX 480 was actually slightly faster or only slightly slower in other games.)

You must be thinking of some other site...

Edit: Also, look at the holiday gift guide... we picked RX 480s. But yep, totally being paid by NVIDIA! /sarcasm. ;-)

March 13, 2017 | 05:58 PM - Posted by Allyn Malventano

I guess Intel bribed AMD to issue this statement then, eh, Chen?

March 11, 2017 | 02:25 AM - Posted by cam (not verified)

"The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core."

Windows driver dev here - I can shed some light on why Windows does this and why it isn't as stupid as people think.

When a thread becomes ready for the CPU (a wait has been satisfied, it has become the highest priority thread, etc), the scheduler will look for an available logical CPU to place it on. That is - a core either idle or running a thread of lower priority than the one being scheduled.

If possible, the scheduler will try to place the thread on the logical CPU it was last running on (in the hopes the cache will still be hot), however sometimes at the very instant that it tries to do that, it finds another thread of equal or higher priority running on that CPU already.

Rather than waste time waiting for the ideal CPU to become available, the scheduler will go ahead and migrate the thread to a different CPU to allow it to execute immediately. This improves ready-to-running scheduling latency. The cache hotness cost is easily offset by the performance win of having the thread running even on a cache-cold CPU.

March 11, 2017 | 02:53 AM - Posted by Anonymous (not verified)

Interesting, thank you for clearing this up. Does this also account for single threaded software having its work load bounced around COU Cores (logical & physical)?

March 11, 2017 | 02:54 AM - Posted by Anonymous (not verified)

CPU*

March 11, 2017 | 04:06 PM - Posted by cam (not verified)

It certainly could. At any given time, there are system threads created by drivers and the kernel itself doing bookkeeping tasks (receiving packets from the network, sending commands to your GPU to update the screen, flushing cached data to disk, etc.).

System threads are generally higher priority than application threads, so when one of them becomes runnable and maybe the other CPUs are in a low-power state, they will preempt your task even it though it appears there are plenty of free CPUs. If you want to test this, try setting your single threaded task to high or real-time priority for a little bit and see if it continues to bounce around. Be careful because you may cause the system to be unstable by preventing kernel threads from running on that CPU.

Some interrupt handling routines are locked to certain CPUs, these will always preempt your thread, no matter the priority.

If you have a particularly CPU intensive task, this will increase the chance that your thread will be moved, because any interruption will probably kick it to another CPU since it's ready to execute for the majority of time (rather than waiting on I/O, network, etc. that would allow the scheduler to leave it alone until it's ready to run again).

March 11, 2017 | 03:50 AM - Posted by Anto (not verified)

I also suspect that Window's scheduler is not responsible for poor performance in gaming.

If Window's scheduler is responsible, how is Ryzen performance on Cinebench is very good? Isn't it should be bad too?

I suspect, this has something to do with L3 cache in Ryzen architecture, which is more difficult to fix.

March 11, 2017 | 04:02 AM - Posted by Anonymous (not verified)

Why don't you write come C code to use hwloc (https://www.open-mpi.org/projects/hwloc/) to pin threads to specific cores and evaluate the performance?

Using hwloc you can benchmark what the throughput should be if cores are properly scheduled (by pinning them yourself) then compare the results to what windows does on it's own?

March 11, 2017 | 04:10 AM - Posted by Martin Trautvetter

Very nice work, thanks for sharing your results!

March 11, 2017 | 05:00 AM - Posted by JohnGR

Ryzen is brand new, but it would have been at least strange if there was something wrong with Windows 10 and AMD and MS didn't knew about it. And definitely AMD would have been giving AMD Ryzen samples to reviewers with a PS saying that Windows 10 fixes are on the way, if there was a huge problem there.

On the other hand even MS was saying that there is probably something to fix there.

http://www.guru3d.com/news-story/microsoft-confirms-windows-bug-is-holdi...

That if we assume that the person who posted that reply knows what is talking about and not just reads from the manual and then posts a generic reply.

March 11, 2017 | 05:08 AM - Posted by Allyn Malventano

That really looks like a generic reply to me. Surely if there was something they needed to look into, they wouldn't be phrasing it as if they were just now about to start doing so :). I'm frankly surprised to see Guru3D latching onto it so seriously.

All of that aside, it really does just look like the CCX modules (and associated latency penalty for spreading coherent threads across them) puts Ryzen in a grey area regarding how they chose to configure their CPUID. AMD could have set it to direct Windows to split the CCX's into NUMA nodes, but that would bring with it other problems for apps not properly coded to recognize it. Apps not coded to handle it would be restricted to a single node (half of the cores). It probably just boiled down to a decision to go with the config that had the least negative impact on the current software landscape relating to NUMA-aware code.

Unfortunately, this enumeration takes place way deep in the hardware (CPUID changes are at the BIOS microcode level to patch), so there is not an easy way to make this a user-configurable toggle outside of hacking the Windows kernel.

March 11, 2017 | 06:36 AM - Posted by Anonymous (not verified)

There are patches being adde even for the linux kernel to fix SMT, not saying it will boost performance by much, but as you can see it clearly states, that there are some issues in the OS how they handle the Ryzen threads:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?i...

March 11, 2017 | 06:49 AM - Posted by Anonymous (not verified)

actually I just find some tests where they compare linux kernel 4.4 vs 4.10 which contains the ryzen SMT patch, and it seems that in some workloads like 7zip compresion and etc. it gave some boost:
https://www.servethehome.com/amd-ryzen-7-1800x-linux-benchmarks-paying-f...

March 11, 2017 | 05:29 PM - Posted by JohnGR

To be honest, all this story about how windows will get patched and Ryzen will jump in performance, reminds me the whole fuss about the "dual core optimizer" about a dozen years ago.

Also it is a little ironic. 10 years ago while waiting for the TRUE quad core Barcelona, we where laughing at Intel using glue to glue together two cores and create a dual core. Now AMD is using glue to create 8 cores, 16 cores etc. CPUs.

--------

Anyway. Ryzen is a beast. It doesn't perform as a Kaby Lake 7700K in today's games? So what? No design is perfect and no design offers everything. AMD created a beast with empty pockets and while at the same time, it was firing 1/3 of it's stuff. This is a tremendous achievement. 6 months ago we where dreaming to see Haswell performance. We got Broadwell for most of the productivity software.

March 12, 2017 | 06:12 AM - Posted by Anonymous (not verified)

I dont think those fixes will boost performance by much if at all, but im certain, that they will iron out the issues where we seen inconsistencies within ryzen performance.

March 11, 2017 | 06:26 AM - Posted by Huh (not verified)

Why aren't you comparing Win7 to Win10 and you would see the culprit is Win10...

If you already do the effort of investigating... just do the obvious thing and test it on Win7.

I have a Dual boot system here with Ryzen. And Windows 7 does application and game benches 20% faster than Windows 10.

March 11, 2017 | 06:43 AM - Posted by rukur

From the reddit thread on the issue. Microsoft is not supporting Ryzen on windows 7 so AMD added support ?

https://www.reddit.com/r/Amd/comments/5yqzyb/amd_ryzen_and_the_windows_1...

March 11, 2017 | 07:02 AM - Posted by Huh (not verified)

LOL?
Here AM4 Ryzen Chipset driver for Windows 7:
http://support.amd.com/en-us/download/chipset?os=Windows+7+64

Also, I have the system running as I said. I don't know why people spread this lie that Ryzen isn't supported on W7.

Other reviewers have also tested Windows 7 already and found better performance because of better core ussage in W7.

https://www.youtube.com/watch?v=XAXS8rYwGzg

March 11, 2017 | 06:32 AM - Posted by mateau (not verified)

A likely candidate for the problem could be the NVidia TX 1080.

Asynchronous Compute And Asynchronous Shader Pipelines just does not work with NVidia hardware. And the driver emulation provided by NVidia is also a poor solution.

Since nobody is talking about this and given the controversy regarding the 3d Mark Time Spy benchmarks, a conclusion cold be drawn that the inability of GTX 1080 to support Asynchronous Compute may be the culprit.

March 11, 2017 | 06:58 AM - Posted by Anonymous (not verified)

So what happens when you disable 4 cores and thus effectively eliminate the Infinity Fabric traffic?

Theoretically it should perform much closer to say, a 7700K, then. In fact, perhaps not a bad idea to see what happens when you run both at 4.0 too; compare everything from 8C/16T, 8C/8T and 4C/8T.

March 11, 2017 | 06:58 AM - Posted by Osjur (not verified)

"With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen."

Allyn Malventano, I think you should chance that "all games" to something else because there are games, quite many in fact which are coded to work with numa systems.

March 12, 2017 | 10:37 PM - Posted by Allyn Malventano

Yes, this is possible, but then every application not specifically coded to be NUMA aware will then be restricted to half of the available cores.

March 11, 2017 | 07:22 AM - Posted by asH (not verified)

Before you all mob-gather to burn the witches

Have you geniuses thought to look at DX11 or how game engines, or for that matter 'to the metal' could be affecting all of this?? After All numa-aware programs work fine.

March 11, 2017 | 07:23 AM - Posted by Anonymous (not verified)

I don't understand the intel latency? Did you measure communication core by core or did you ping all cores at the same time.

Maybe somebody can explaine me how the intel ring bus is working.

For example core 1 and core 4 must communicate and core 2 and 5 must communicate at the same time. How is the behaviour of the Ringbus? Must core 2 and 5 wait till 1 and 4 is finished, or can they communicate all the time?

March 11, 2017 | 07:39 AM - Posted by asH (not verified)

If Intel 8 cores are both Numa and UMA aware - How does Intel 8 core CPU's achieve competitive 2 core FPS scores vs 7700? Curious

March 11, 2017 | 08:00 AM - Posted by Jann5s

Great stuff, I think these type of simple benchmarks, that directly measure the performance of CPU subsystems should be in all CPU reviews. It could be a page,

March 11, 2017 | 08:36 AM - Posted by Anonymous (not verified)

A big thanks to PCPer for this depth in.

I'd like to point out that Intel has CPUs with multiple ringbuses, the MCC and HCC variants of Haswell/Broadwell-EP. picture

It would be interesting to see how the latency is between the two ringbuses.
Can you possibly release the source code or repeat the test on a corresponding Xeon?

Although, that CPU's should not appear much on consumer clients, they might on workstations. It would be interesting to know, if the Windows scheduler is aware of that and tries to prevent bouncing in between the different ringbuses.

March 11, 2017 | 08:45 AM - Posted by Harney (not verified)

Great Write up Allyn Thank you..

I would like to see testing done on Windows 7 because i am testing my self here with a couple of Ryzen builds and win 7 seems to be handling much better ..I am getting fantastic minimums on 7 than 10.....

March 11, 2017 | 08:57 AM - Posted by Mjoelner (not verified)

When I test with intels MLC software then the local socket L2->L2 HIT latency is 38ns and L2->L2 HITM latency is 42 ns. This is for a non-overclockable Xeon E5 2640v4 with 25mb cache (2.7 ghz L3 cache). Does MLC work with AMD CPU's? I would like to know the numbers for the Ryzen CPU. Perhaps the HIT and HITM numbers are very different? I suspect the Infinity Fabric is used for moving data if cachelines are unmodified? Otherwise data must be written to DRAM first?

Also, please provide the Ryzen numbers for loaded latency.

March 11, 2017 | 08:59 AM - Posted by pln (not verified)

>custom written C++ application with a very simple premise: ping threads between cores

Would you please release your source code [no need to tidy it up or add comments]? Several of us would like to experiment by executing your exact code after compiling it with different compilers, and with different background processes and services controlled.

March 12, 2017 | 04:00 PM - Posted by Anonymous (not verified)

Custom tweaked from cut and paste, or NDA related sort of things! Go to The Anandtech Ryzen: Strictly technical forum and Malventano is also taking part in that discussion. There are whole testing software/SDK packages available to help in that testing code creation process.

And wait untl the server market pros get their hands on the RTM Zen/Naples SKUs! There will be plenty of things sussed out there to assure top performance for any workloads that may tax the Zen/Naples and Zen/Ryzen(By Extention) CCX units/Infinity Fabric/cache subsystems in a negative way.

March 11, 2017 | 10:19 AM - Posted by asH (not verified)

Windows 7 Non-Uniform Memory Access Architectures http://news.softpedia.com/news/Windows-7-Non-Uniform-Memory-Access-Archi...

Is there any correlation to Windows O/S after Win7 and DirectX after Win7??

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.