Feedback

AMD Ryzen and the Windows 10 Scheduler - No Silver Bullet

Subject: Processors
Manufacturer: AMD

** UPDATE 3/13 5 PM **

AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):

  • "We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."

  • "Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows.  Any differences in performance can be more likely attributed to software architecture differences between these OSes."

So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:

  • "Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.

    For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."

We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.

** END UPDATE **

Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout

Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing. 

I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:

View Full Size

The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.

My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.

View Full Size

SMT OFF, 8 cores, 4 workers

With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.

Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:

View Full Size

SMT ON, 16 (logical) cores, 8 workers

With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.

Synthetic Testing Procedure

While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.

This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.

By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions - when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.

View Full Size

Click to Enlarge

This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.

View Full Size

If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well. 

Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.

View Full Size

Click to Enlarge

Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.

The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.

Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.

Continue reading our look at AMD Ryzen and Windows 10 scheduling!

Pinging Cores

One potential pitfall of this testing process might have been seen if Windows was not enumerating the processor logical cores correctly. What if, in our Task Manager graphs above, Windows 10 was accidentally mapping logical cores from different physical cores together? If that were the case, Windows would be detrimentally affecting performance thinking it was moving threads between logical cores on the same physical core when it was actually moving them between physical cores.

To answer that question we went with another custom written C++ application with a very simple premise: ping threads between cores. If we pass a message directly between each logical core and measure the time it takes for it to get there, we can confirm Windows' core enumeration. Passing data between two threads on the same physical core should result in the fastest result as they share local cache. Threads running on the same package (as all threads on the processors technically are) should be slightly slower as they need to communicate between global shared caches. Finally, if we had multi-socket configurations that would be even slower as they have to communicate through memory or fabric.

Let's look at a complicated chart:

View Full Size

What we are looking at above is how long it takes a one-way ping to travel from one logical core to the next. The line riding around 76 ns indicates how long these pings take when they have to travel to another physical core. Pings that stay within the same physical core take a much shorter 14 ns to complete. The above example was a 5960X and confirms that threads 0 and 1 are on the same physical core, threads 2 and 3 are on the same physical core, etc. 

Now lets take a look at Ryzen on the same scale:

View Full Size

There's another layer of latency there, but let us focus on the bottom of the chart first and note that the relative locations of the colored plot lines are arranged identically to that of the Intel CPU. This tells us that logical cores within physical cores are being enumerated correctly ({0,1}, {2,3}, etc.). That's the bit of information we were after and it validates that Windows 10 is correctly enumerating the core structure of Ryzen and thus the scheduling comparisons we made above are 100% accurate. Windows 10 does not have a scheduling conflict on Ryzen processors.

But there are some other important differences standing out here. Pings within the same physical core come out to 26 ns, and pings to adjacent physical cores are in the 42 ns range (lower than Intel, which is good), but that is not the whole story. Ryzen subdivides by what is called a "Core Complex", or CCX for short. Each CCX contains four physical Zen cores and they communicate through what AMD calls Infinity Fabric. That piece of information should click with the above chart, as it appears hopping across CCX's costs another 100 ns of latency, bringing the total to 142 ns for those cases.

While it was not our reason for performing this test, the results may provide a possible explanation for the relatively poor performance seen in some gaming workloads. Multithreaded media encoding and tests like Cinebench segment chunks of the workload across multiple threads. There is little inter-thread communication necessary as each chunk is sent back to a coordination thread upon completion. Games (and some other workloads we assume) are a different story as their threads are sharing a lot of actively changing data, and a game that does this heavily might incur some penalty if a lot of those communications ended up crossing between CCX modules. We do not yet know the exact impact this could have on any specific game, but we do know that communicating across Ryzen cores on different CCX modules takes twice as long as Intel's inter-core communication as seen in the examples above, and 2x the latency of anything is bound to have an impact.

Some of you may believe that there could be some optimization to the Windows scheduler to fix this issue. Perhaps keep processes on one CCX if at all possible. Well in the testing we did, that was also happening. Here is the SMT ON result for a lighter (13%) workload using two threads:

View Full Size

See what's going on there? The Windows scheduler was already keeping those threads within the same CCX. This was repeatable (some runs were on the other CCX) and did not appear to be coincidental. Further, the example shown in the first (bar) chart demonstrated a workload spread across the four cores in CCX 0.

Closing Thoughts

What began as a simple internal discussion about the validity of claims that Windows 10 scheduling might be to blame for some of Ryzen's performance oddities, and that an update from Microsoft and AMD might magically save us all, has turned into a full day with many people chipping in to help put together a great story. The team at PC Perspective believes strongly that the Windows 10 scheduler is not improperly assigning workloads to Ryzen processors because of a lack of architecture knowledge on the structure of the CPU.

In fact, though we are waiting for official comments we can attribute from AMD on the matter, I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance. 

In the process, we did find a new source of information in our latency testing tool that clearly shows differentiation between Intel's architecture and AMD's Zen architecture for core to core communications. In this way at least, the CCX design of 8-core Ryzen CPUs appears to more closely emulate a 2-socket system. With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen. How does this new information affect our expectation of something like Naples that will depend on Infinity Fabric even more directly for AMD's enterprise play?

There is still much to learn and more to investigate as we find the secrets that this new AMD architecture has in store for us. We welcome your discussion, comments, and questions below!

Video News


March 11, 2017 | 10:22 AM - Posted by Anonymous (not verified)

The problem is the penalty when the scheduler moves a thread across the CCXs, which are like NUMA nodes (or rather, should simply be considered two separate CPUs altogether like a dual CPU system).

The penalty is huge for that. Xeons and other systems are properly handled by Windows 10 scheduler. Ryzen is not. It presents itself as a single CPU, so windows doesn't care about scheduling across CCXs. And you take about a 10x penalty for that (22gb/s vs 200gb/s+, roughly) for moving between the IF (Intel called theirs QPI I think).

This article is really ignorant. You need to change the title, and particularly that bolded editor's note at the top.

March 11, 2017 | 11:58 AM - Posted by Allyn Malventano

It's up to AMD to identify the need for NUMA segmentation in their CPUID. That's not the schedulers fault.

March 11, 2017 | 03:14 PM - Posted by Pixo (not verified)

Ryzen is not truly a NUMA as that would mean each CCX would
have its own RAM tied to it.
And it is scheduler task to take cache into consideration when scheduling where to run thread.
It seems that cache was wrongly detected for a time or detection was/is unreliable.
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/#p...

Scheduler update or correct cache detection may not be silver bullet but would help.

March 12, 2017 | 01:12 PM - Posted by willmore

Ryzen chips aren't actually NUMA as each core can access all of main memory with the same latency. That's what NUMA schedulers are meant to address. In true NUMA systems, the goal of a NUMA aware scheduler is to keep a process on a core as close to its memory as it can.

OS support for NUMA goes beyond just the scheduler, it is in the memory allocator as well--you only want to allocate memory to a process that is local to the core it's bound to.

The situation with Ryzen is more like the first 'dual core' Intel chips which had two cores (each with local L2) which shared access to main memory via the FSB. The difference is that the two CCX in Ryzen share access to main memory internally instead of through a shared FSB and they have local L2 and L3.

It's not difficult to belive that Windows 10 would need a little tuning to support that configuration. It's also not difficult to believe that Windows 7 retains support for it. One could imagine the conversation "Shall we keep support for Smithfield?" "If people run Win10 on Smithfield, they shouldn't expect the best performance, so no, let's drop support for that."

So, no, AMD was not wrong to not set the CPUID info to declare the two CCX as NUMA (because they're not). I don't think an OS could get enough information from the CPUID info to properly support this--unless you include the CPU VendorID/Family/Model info in that. Then you'd need a table of 'quirks' that say, oh, yeah, treat the first four physical cores as one processor and the second four as another and schedule appropriately.

And, it may go beyond that. Win10 may have dropped support for that Smithfield type of scheduling because of the reasonable expectation that Win 10 wasn't going to run on a dusty old Pentium D.

March 11, 2017 | 10:33 AM - Posted by Jeremy (not verified)

Hey Allyn, this was a great read that finally seems to diagnose what is really going on here with the R7 series. My question is are the 6 and 4 core variants the same architecture in the sense that they will also be a 2 module 3 core for the six core and a 2 module 2 core for the quad core Ryzens? Or do we not know this information yet? If not it would fix this gaming weakness correct? Since it would eliminate the cross module communication.

March 11, 2017 | 12:01 PM - Posted by Allyn Malventano

The 6 core will be 3 per CCX, but I really hope the quad use some only one CCX because plenty of things will want to be able to hit 4 logical cores without having to spill over onto another CCX.

March 11, 2017 | 12:10 PM - Posted by Anonymous (not verified)

I think you really overestimate the amount of latency-sensitive IPC of common workloads and underestimate the value of larger cache pools.

Most people will not be trying to do FEA on their 6c R5-1600X or whatever.

March 11, 2017 | 12:16 PM - Posted by Jeremy (not verified)

So then in theory the 6 core variant would further magnify the problem of the 8 core processors since even lighter loads would have to cross ccx complexes as compared to the R7s? I would assume the 4 core variant would have no need for separate modules and simply be one half of the R7 series. Correct me if I'm on the wrong in thinking this. Thanks

March 12, 2017 | 09:21 PM - Posted by Alamo

AMD made alot of compromises to save money, so the design of 4 core ccx is all they got as far as i know, so the 4 core is definitely 1 ccx, 6 cores might be 4+2 or 3+3, but the 4+2 makes more sense when considering the current issue with infinity fabric.
and again infinity fabric's problem is not speed, it's the overall limit, tweaking targeted balance according to size and priority to get rid off unnecessary back and forth between ccx, will go along way into helping latency compared to all the current random swaping of windows.
so the bottom line Ryzen can swap between CCXs without losing performance, just need to keep the fabric's load within reason.

March 13, 2017 | 12:56 AM - Posted by pohzzer (not verified)

Allyn Malventano - "but I really hope the quad use some only one CCX"

Why would AMD not have a 4 core die when that is the processor that will account for the bulk of their sales? If they have a 32 core Naples on the way they surely did a clean 4 core design for Ryzen 3 and 5?

March 13, 2017 | 02:34 AM - Posted by Tim Verry

Reportedly, Ryzen 5 will be the same 2 CCX as Ryzen 7 but with one core disabled on each. (See here: https://www.pcper.com/news/Processors/AMD-Launching-Ryzen-5-Six-Core-Processors-Soon-Q2-2017)

What we are wondering now is whether AMD's scalable 2 CCX design will continue that trend to scale down to Ryzen 3 by disabling two cores on each CCX for a quad core (4c/8t) part or if AMD will use a single CCX for Ryzen 3 either by turning off a CCX from the normally 2 CCX monolithic die or a Raven Ridge sans GPU die which would be a single CPU CCX (speculating here that is, I don't know for 100% certainty RR dies are setup that way). In our work chat I was debating this and the binning benefits/strategies/product segmentations possible with both options. My guess, knowing AMD is going to be AMD, it will probably be a mixture of both where Ryzen 3 chips can be either 2 CCX with two cores each or one CCX enabled, one disable for 4 cores on one CCX and customers will just have to play the silicon lottery to get chips that have the single CCX type of binning. This would give them the most salvageable dies from binning for the product that will be the highest volume/most mainstream part but of course has the wrinkle of some chips not performing some workloads the same due to this inter-CCX latency. Shrug... could go either way heh. I hope that it is a single CCX for simplicity and performance sake though!

Edit: as for the 32 core Naples, it is actually four 8c/16t 2-CCX dies on package rather than a monstrous 32 core monolithic die. https://www.pcper.com/news/Processors/AMD-Prepares-Zen-Based-Naples-Server-SoC-Q2-Launch

March 11, 2017 | 10:41 AM - Posted by asH (not verified)

How to Set Processor Affinity to an Application in Windows 7 https://www.sevenforums.com/tutorials/83632-processor-affinity-set-appli...
   Information
Processor affinity or CPU Pinning enables the binding and un-binding of a process or thread to a physical CPU or a range of CPUs, so that the process or thread in question will run only on the CPU or range of CPUs in question, rather than being able to run on any CPU

By default, Vista and Windows 7 runs an application on all available cores of the processor. If you have a multi-core processor, then this tutorial will show you how to set the processor affinity for an application to control which core(s) of the processor the application will run on.

March 11, 2017 | 11:25 AM - Posted by SpenReyn (not verified)

If you look at these tests:

http://www.zolkorn.com/en/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-b...

The NUMA effectively shutting off the last 8 cores of a Ryzen processor wouldn't necessarily be a bad thing, since a ryzen with 4 cores an half of its L3 cache disabled competes very well with an i7-7700k.

It could be a very simple solution for AMD and the Motherboard makers until NUMA support can be added to games.

And as an added suggestion you really should take a look into the compiler bias "conspiracy theories ", since Intel settled that code out of court for more than 1billion dollars there could be some truth to it.

March 11, 2017 | 11:30 AM - Posted by Anonymous (not verified)

This article has good initial data collection but questionable analysis and a premature conclusion, since SMT core allocation and inter-core/CCX latencies are only part of the picture.

Something that could murder Zen performance but will not affect smaller Haswells/Broadwells much is process migration across arbitrary cores, that is, back and forth across CCXs. Smaller Xeons (and their -E workstation counterparts) have every core sitting on a single ring bus with cache lines hashed or otherwise interleaved by low-order address bits across the set. At worst, an i7-5960X needs to rewarm its L2 cache, which in 256 kiB and has a 64B pipe to L3.

On the other hand, Zen has twice as large L2s at 512 kiB, only 32B lanes to L3, and higher latencies to the remote CCX, so full cache rewarming will take at least 4x more time even in the case that all the needed lines are still in the remote L3 cluster.

Before claiming that scheduling has nothing to do with the issue, it needs to be measured and provable that the scheduler does not lightly bounce given threads across CCXs on a moderately to heavy system load.

As a final note, it seems disappointing that the scheduler put 4 moderately busy threads on the same CCX, since most workloads would benefit more from bigger shares of local L3 than from lower inter-core latency.

March 13, 2017 | 05:03 PM - Posted by Anonymous (not verified)

They said that SMT scheduling is not the issue, not that there are no scheduling issues.

March 11, 2017 | 11:33 AM - Posted by SpenReyn (not verified)

Another conspiracy theory for you: AMD deliberately didn't try to fix the Windows 10 NUMA problem so they would have an excuse for letting the Motherboard companies release Windows 7 drivers. Why? To work around their internal agreements to support only Windows 10 and still satisfy the 49% of gamers who still use Windows 7?

March 11, 2017 | 02:00 PM - Posted by Anonymous (not verified)

I'm with you Spen... Thought the same thing.
What a great move by AMD.. lol

March 12, 2017 | 09:58 PM - Posted by Anonymous (not verified)

No not really! If anything may hold true for Ryzen getting windows 7 support, it will be any Zen/Napels business clients that may be using windows 7 and have a lot of money tied up in their mission critical software that is only certified to work under windows 7. The biggest IT expense in most businesses is in their custom software/mission critical software that cost way more than the cost of a single OS/OS licenses to develop. That's why XP was supported for some after XP went EOL, and 7 is the new XP!

If any server client wants Zen/Naples and there is millions to be made both AMD and M$ will make some exceptions to the rules that no one will hear about because of NDAs in legal contracts!

March 11, 2017 | 12:02 PM - Posted by Anonymous (not verified)

Yeah, SMT itself works just fine. It's when threads get migrated across CCXs (or have heavy dependencies across the CCX boundary) that it causes problems.

Nothing to do with SMT really. Just affinity issues with how threads are placed/moved. Win 10 has no problem placing a thread that needs to share cache data from one CCX onto the other one. Makes no difference if it's a SMT thread or not.

March 11, 2017 | 12:05 PM - Posted by John H (not verified)

Very good analysis PCPer - thanks for doing this.

Do we really think that there is a software 'fix' for the gaming performance issue? The Ryzen is a great MT CPU but is clearly down on ST (and therefore "lightly threaded") performance vs. Intel.

Is the gap in gaming performance much larger than expected vs. the single/four threaded performance gap between Ryzen 7 and 7700K (or 6900K in 4-thread mode)?

March 11, 2017 | 12:23 PM - Posted by ZoA (not verified)

PCper makes claims that are trivially easy proven to be false.

Win 10 clearly has negative effect of Ryzen performance when compared to Win7 performance so clearly PCper claims Win10 performs adequately when it comes to Ryzen is blunt patent falsehood. For all i know all those test they claim that they have conducted are pure fiction as well. PCper stoops at now low with this kind of deliberate consumer misinformation ad deceit.

March 11, 2017 | 01:28 PM - Posted by John H (not verified)

Is there a link you can share that shows the performance difference between the two? I've seen the wccftech articles conjecturing this, but no hard data?

March 11, 2017 | 01:29 PM - Posted by Alamo

i didn't find any articles with solid comparaison between 7 and 10 yet, all i found is like a video and some forum posts.
i would love to see any article if you have a link.

March 11, 2017 | 02:23 PM - Posted by John H (not verified)

https://www.reddit.com/r/Amd/comments/5xkhun/total_war_warhammer_windows...

Finally found this link

The system tests were identical except they used different FPS measuring software in Win 7 vs Win 10. Ryzen @ 3.5 GHz.

Basically -
Windows 7 didn't show any performance difference with SMT ON vs OFF.
Windows 7 is 10% faster in minimum FPS for SMT OFF and 20-25% with SMT ON than Windows 10 (!).
The OP did mention this was the game he had showing the biggest difference.

March 12, 2017 | 10:43 AM - Posted by ZoA (not verified)

https://www.youtube.com/watch?v=U9DE83lMVio

March 11, 2017 | 12:28 PM - Posted by Alamo

i am pretty sure that if AMD released the 4 cores along side the 8 core, none of this would'v happened, the lack of an alternative is what fueled the gaming debate.
4cores for 1080p
8 cores for 1440p/4k

March 11, 2017 | 01:23 PM - Posted by CNote

My thoughts exactly...

March 11, 2017 | 02:15 PM - Posted by Anonymous (not verified)

There are some benchmarks being run using some AM4 motherboards with the features in the MB's UEFI/BIOS enabled that allow 8 core Ryzen SKUs to be disabled down to 4 cores 8 threads and the benchmarks show the half enabled Ryzen(1800X) SKU doing not so bad compared to the 7700K when both where clocked at 4.0 GHz and benchmarks were run.

So maybe there will have to be a windows optimization that allows for setting affinity by the CCX unit and games designed to keep any draw call workloads/threads dispatched to a particular CCX unit kept on that CCX unit with no dependent cross thread draw call workloads spread across both CCX units. The problem happens when any draw call workloads are bumped by the OS scheduler before the work is completed on that thread’s workload, with threads/tasks residing one CCX unit’s core/s being transferred(Before workload completion) to the other CCX units core/s and that extra cache latency hop has to take place over the Infinity fabric to handle the extra cache coherency traffic that results. The processor thread’s needed cache Data/Code distribution across CCX units boundary becomes fragmented across the CCX units boundary requiring extra latency inducing steps to get cached Code/Data accessed.

So if any work(Draw Call, Other) is dispatched to one CCX unit’s core/s the that work should stay there until completion with no moving of that workload outside of the CCX unit boundary while the work task is ongoing. New draw call work should be allowed to be initially dispatched to one or the other CCX unit but once assigned to a core on a particular CCX unit that entire workload and thread task should stay on the same CCX unit until task(Draw Call/other) completion.

There are even some people suggesting that the Ryzen CCX units be logically split NUMA like so the OS can treat each CCX unit like a 2P/2 socket system logically.

So yes it’s not that windows is not doing the job it was initially designed to do with respect to dispatching threads. It’s that the current way that windows dispatches threads in its OS is not optimized for Ryzen or Ryzen’s new CCX hardware construct that has better latency performance within(Intra-CCX) the CCX unit than latency performance across the CCX units(Inter-CCX).

March 11, 2017 | 02:28 PM - Posted by Anonymous (not verified)

P.S. AMD designed the CCX UNIT with modular scalable usage and fabrication in mind so future SKUs can be scaled up or down by the CCX unit. This modular design methodology allows for Better die/wafer yields to be had via smaller sized die production. It also allows for a very scalable lower cost way of adding processor power in increments to any future SKUs. The same modular design methodology will be applied to the Navi based GPU designs and even any future APU designs where CPU and GPU chiplets can be scaled up or down to produce products for all markets laptop to desktop on up to server and supercomputer.

March 11, 2017 | 12:47 PM - Posted by Mike S. (not verified)

Thanks for investigating this, Allyn and Ryan. Even if you're proving a negative, it's interesting stuff.

Neat to see all of the layers involved, factors, etc...

Even though Ryzen performance delivers on the 40% promised, 52% delivered IPC boost over their AM3 socket processors, I was one of the fanboys hoping they could match Skylake or Kaby Lake IPC.

I'm still pleased with Ryzen. I figure there are millions of gamers happily using Haswell, Ivy Bridge, even Sandy Bridge, so an R7 1700 or an R5 1600X should let me play most games with acceptable performance on the CPU side for years.

March 11, 2017 | 02:06 PM - Posted by DevilDawg (not verified)

Allyn, do you think that either AMD or MS could possibly look into patching the scheduler to read as if it is a dual Xeon part? I mean it has two CCx's connected. I mean could it be possible to see if will work that way as if it is a actual two cpu part. I'm just curious what do you think about that.

March 11, 2017 | 02:08 PM - Posted by Anonymous Bob (not verified)

Delidded Ryzen 7s have 2 chips. A CCX IS a separate chip. Ryzen 5s will have 3 cpus per chip, using the one cpu defective chips. I'd bet Ryzen 3s will have 2 cpus per CCX. Eventually yields will rise. I remember when Phenom IIs were bought in cheap 2 and 3 proc versions and the unused cpus unlocked to get all 4. That might happen for Ryzensm eventually.

March 11, 2017 | 02:28 PM - Posted by Josh Walrath

Ryzen is one monolithic die.  If you look closely at the delidded CPUs you can see that the die is continuous.  What makes it look like 2 dies is the two solder patches are square and appear as two units.

March 11, 2017 | 02:27 PM - Posted by Anonymous (not verified)

You guys need to focus on cross-CCX thread switching, this is same issue as the dual Jaguar modules in the consoles where devs were told to never cross modules or latency goes through the roof. Now throw what is effectively an L3 flush when this happens on Zen and an aggressive OS and you have a recipe for poor gaming performance. I'm not giving AMD a pass because they should have seen this coming, but to say that the OS and the applications are not responsible at all is likely incorrect.

March 12, 2017 | 12:02 AM - Posted by Anonymous (not verified)

He is referring to this https://www.reddit.com/r/Amd/duplicates/5ybrxn/ryzen_7_is_actually_behav...

kind of like Nvidia's 970 3.5 fiasco but like 1000x more shitty for gamers.

March 11, 2017 | 03:32 PM - Posted by Fabiano Durão da Motta (not verified)

I like the graphs and the work done and forgive me, but what is exactly is exactly new ?
"Most assuredly that Windows scheduler had no business on Ryzen issues". Still, just like everyone else, can't really point a finger on what's is exactly wrong on there - which most assuredly means, not sure.

Bleh

March 11, 2017 | 04:50 PM - Posted by Kronus (not verified)

Great write up, glad to see some real testing for the problem rather than all the arm chair nonsense being plastered all over the web.

I'm curious to see if the win7 vs win10 argument has any validity and whether or not there is any actual performance being left on the table.

March 11, 2017 | 05:16 PM - Posted by Anonymous (not verified)

that crap has been in dev for at least 4 years, is supposed to be W10 only, and they just realized NOW that there is something wrong.... with W10. Bunch of clowns.

March 11, 2017 | 06:59 PM - Posted by Anonymous (not verified)

Lol, this dropped off the AMD fanboi reddit pretty quick.

The Butthurt is Strong with this one - even though the product is really good esp. in the price / perf but unless it beats every CPU in existence today, the fanbois will cry... What a shame...

March 11, 2017 | 07:13 PM - Posted by Whatever707 (not verified)

They're at it again it seems. Microsoft themselves have acknowledged the problem, they should know...

https://www.guru3d.com/news-story/microsoft-confirms-windows-bug-is-hold...

And conveniently avoiding to make comparisons with Win 7's better implementation for the Zen architecture which sees some massive improvements in some games:

Quote:
"All of these were recorded at 3.5GHz, 2133MHz MEMCLK with R9 Nano:

Windows 10 - 1080 Ultra DX11:

8C/16T - 49.39fps (Min), 72.36fps (Avg)
8C/8T - 57.16fps (Min), 72.46fps (Avg)

Windows 7 - 1080 Ultra DX11:

8C/16T - 62.33fps (Min), 78.18fps (Avg)
8C/8T - 62.00fps (Min), 73.22fps (Avg)"

Just explain this^^, I'm waiting... And why not TEST IT YOURSELF instead?

Source: https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/pa...

March 11, 2017 | 08:13 PM - Posted by John H (not verified)

Would also be good to see the same game on an Intel processor, R9 Nano, on both platforms..

March 14, 2017 | 01:24 AM - Posted by Anonymous (not verified)

It is possible that the windows 7 scheduled was. It designed to handle 16 core processors and it is basically keeping threads limited to one CCX by accident. It will be interesting to see some more testing.

March 14, 2017 | 01:24 AM - Posted by Anonymous (not verified)

It is possible that the windows 7 scheduled was. It designed to handle 16 core processors and it is basically keeping threads limited to one CCX by accident. It will be interesting to see some more testing.

March 11, 2017 | 08:14 PM - Posted by 0VERL0RD (not verified)

Intel Paper on Hyperthreaded games:- https://software.intel.com/en-us/articles/multithreaded-game-programming...

"False sharing can cause some serious performance degradation on both dual- or multi-processor and HT-enabled systems. False sharing happens when multiple threads own private data on the same cache block. For Pentium 4 processors and Xeon processors, a cache block is effectively 128-bytes. Determining whether false sharing is an issue is easy with a profile from the VTune Analyzer, and fixing the problem is as easy as padding out data structures. Perhaps an even better fix is to structure data in a cache-friendly manner, on or at 128-byte boundaries. Note that these recommendations are very complimentary to those for avoiding 64K-aliasing, so watching out for one pitfall actually helps you prevent two or more! See item [5] in the Additional Resources section for a more in-depth explanation of false sharing"

Other resources:-

http://www.iuma.ulpgc.es/~nunez/procesadoresILP/Pentium4/Pentium%204%20I...

http://www.agner.org/optimize/blog/read.php?i=6

https://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html

The work load you tested wasn't GPU intensive like gaming, you might want to also check for DPC latency as Nvidia drivers have been notorious for high latency requiring hot fixes!

March 11, 2017 | 08:19 PM - Posted by sonicstorm (not verified)

While the scheduler seem to be working properly in your tests, I can't help but notice CPU load all over 16 threads in many online reviews. So the questions is: why is it happening? The game code bypasses windows scheduler?

March 11, 2017 | 10:30 PM - Posted by Stop Being A Faggot (not verified)

THIS JUST IN: SHITTY AMD SHIT IS SHITTY!!

Wowwwwwwww, didn't see that coming /s

March 12, 2017 | 12:14 AM - Posted by Anonymous (not verified)

Ryzen is here and you are (#゚Д゚)

you are always spoiling for a ლ(`ー´ლ)

and you now just ε=ε=ε=┌(;*´Д`)ノ around

Looking for a ლ(`ー´ლ)

because you where Σ(゜д゜;) and (((( ;゚Д゚))) that

Ryzen's performance has made folks(^○^) (*^▽^*) (✿◠‿◠)

and now you (≧ロ≦) loudly and expressing exterme (╬ ಠ益ಠ)

because you where so Σ(゜д゜;) and (((( ;゚Д゚))) that

Ryzen's performance was so \(◎o◎)/

and now you are really ヽ(o`皿′o)ノ and (≧ロ≦) loudly

and spoiling for a ლ(`ー´ლ)

March 12, 2017 | 12:21 AM - Posted by Anonymous (not verified)

fyi, i linked to this up above
https://www.reddit.com/r/Amd/duplicates/5ybrxn/ryzen_7_is_actually_behav...

i'm curious who really needs Ryzen 1700x?
Blender peeps?
People maybe compiling some serious code?
Dedicated db workstations?
Scene release groups encoding?
Gamers that stream themselves while playing, maybe?
Benchmark fags?
Any thing else i'm missing?

I know i've gotten by using a atom netbook to do html/javascript and photoshop. that netbook could play the original starcraft just fine to get me by.

i have an i7 because i was future proofing like years ago. Still looks like i should have just saved $30 and got the i5.

the logic to buy what you need for that moment seems to hold true for everything except maybe console refresh transitions.

March 12, 2017 | 01:55 PM - Posted by Mr.Gold (not verified)

Cant generalize.

I still use my Q6600 PC from 2007.
I went with a cheaper quad core VS a faster dual core.

In 2007 the $250 Q6600 was no match in games compared to a $1000 X6800.. But pretty quickly the Q6600 became the better performer.

Also at the time 3.2ghz was fine for gaming. So the Q6600 played all the games.

The exact same is true today. For $330 I would go with R7 1700 vs a $350 i7-7700k (and I did) I as plan to keep this PC for 5+ years.

Hopefully it will be another Q6600, and I will use it dayli for a decade . (The Q6600 was hard to replace because it runs 99% of the games/app I have beautifully)

March 12, 2017 | 12:58 AM - Posted by Anonymous (not verified)

I don't know why people latched onto it being an SMT problem. Both Intel and AMD have SMT and it seems to be functionally very similar. The difference is the separate core complexes that AMD uses. Intel obviously doesn't use separate complexes, and they probably pay a latency penalty for that. The communication between cores in a many core Intel part will be slower than communication between cores in a tightly coupled CCX, but going off the CCX will incur increased latency. Intel also does not scale the memory bandwidth with core count either. With AMD's design, they can scale the memory bandwidth with the number of cores, since it looks like each die adds another dual channel memory controller. It is unclear how the memory controllers are connected to the rest of the system though. The L3 caches and memory controllers may have their own ports on the fabric router to allow for cache coherency. The AMD system architecture seems much more scalable, although there will be some cost due to NUMA overhead. Also, AMD can use cheap small die parts rather than expensive monolithic die parts, which should make them much cheaper to make. This is looking like the situation with Opteron all over again. The Opteron processor brought on-die memory controllers and point to point processor links while Intel was still using a shared bus with the single memory controller on the chipset. Intel eventually went the same route. Intel has stayed with large monolithic die with large L3 caches to make up for the limited memory system for a long time because the profit margins on such things are enormous.

Now I am really curious as to how AMD's fabric actually works. Looking at the die photos I have seen of Ryzen, it looks like it has a lot more un-core stuff than Intel 4-core parts. In fact, it may have more un-core area than Intel Xeon die photos I have seen. Hard to tell without doing precise measurements. Unless the die photos I have seen are actually Naples parts, I am wondering if AMD is really only making a single die variant. With possibly low yields on 14 nm, it may make some sense. They could just be selling the parts with defective links as the current Ryzen parts while stockpiling fully functional die for server parts.

I don't know how AMD's fabric works. It isn't much of a stretch that they could be using configurable high speed links. Any current high speed interconnect is going to be very similar, if not the same, as the PCI express physical layer, so the links used for interprocessor communication could just be configurable as either PCI express or as an inter-processor links depending on what protocol is enabled.

I haven't figured out how the connections would work yet. I suspect that the Ryzen die has a lot of high speed links which are not routed into the package on consumer level parts. I don't know how many it would take. It may have 32 or more links for inter-processor communications in addition to those routed for IO on consumer parts. In server parts, those links may be configurable as PCI express or as interprocessor links.

March 12, 2017 | 03:31 AM - Posted by Anonymous (not verified)

Go read Charlie D's assessment of the AMD Infinity Fabric over at S/A because AMD’s not going doing any deep dives into its Infinity Fabric IP until both Zen/Naples and Radeon/Vega are to market. There is still some NDAs in effect in advance of Zen/Naples and Radeon/Vega actually being released so AMD is all Johnny Tightlips until then!

March 12, 2017 | 06:33 AM - Posted by Anonymous (not verified)

There just doesn't seem to be much info available yet. It seems like something more would have leaked out. The whole thing is kind of reminding me of the planned Alpha EV8 (21464) processor that was never made, but the original K8/Opteron was somewhat based on the same principles also. In the proposed EV8 design, the caches, CPU cores, and memory were all connected to an on die router which was to support routing between up to 512 processors. That was way back in about 2001 though. I don't know if Jim Keller played a significant roll in that design.

Anyway, the high IO bandwidth isn't that surprising. Such distributed systems have massive benefits in that reguard. Even without taking processor interconnect into account, each of the 4 die would probably still have the x16 IO PCI-e lanes of normal Ryzen die. That is 64 lanes right there. I have some ideas on how they may be connecting the links between processors, but that doesn't give me any insight into how that connectivity is handled on die. It does sound like they are possibly running multiple protocols over pci express physical layers though. No need to reinvent the wheel.

March 12, 2017 | 02:39 PM - Posted by Anonymous (not verified)

Alpha EV8 (21464) processor was the first microprocessor with SMT, thaks a lot for that HP, you and Intel with your Itanium fiasco!

Look at the 21464's specs(and see where Intel's SMT(HT) came from with some minor GIMPING on Intel's part:

"The microprocessor was an eight-issue superscalar design with out-of-order execution, four-way SMT and a deep pipeline. It fetches 16 instructions from a 64 KB two-way set-associative instruction cache. The branch predictor then selected the "good" instructions and entered them into a collapsing buffer. (This allowed for a fetch bandwidth of up to 16 instructions per cycle, depending on the taken branch density.) The front-end had significantly more stages than previous Alpha implementation and as a result, the 21464 had a significant minimum branch misprediction penalty of 14 cycles.[4] The microprocessor used an advanced branch prediction algorithm to minimize these costly penalties."(1)

(1)

"Alpha 21464"

https://en.wikipedia.org/wiki/Alpha_21464

March 12, 2017 | 08:07 AM - Posted by ET3D (not verified)

> Both Intel and AMD have SMT and it seems to be functionally very similar.

Totally untrue. Can't find the ref right now, but one site did compare games with SMT off for both AMD and Intel, and AMD got a big performance boost in some games. AMD differences were much higher.

That is why people have been blaming the Windows 10 scheduler (especially since on Windows 7 Ryzen behaved much closer to the Intel behaviour).

March 14, 2017 | 01:34 AM - Posted by Anonymous (not verified)

It wouldn't be surprising if Intel's SMT implementation (which has been tweaked through many processor generations) is a bit better than AMD's first generation implementation. That doesn't change the fact that the scheduling for SMT doesn't need to be any different. It will still prefer to use one thread per physical core first before loading multiple cores. This whole article was about how the SMT, and scheduling for it, is not being handled any differently between the two architectures. SMT is not the problem for the scheduler. Also, Intel chips can perform worse for some applications with SMT enabled. We leave it disabled where I work since the applications we use do not share cache well; it performs better with it off.

March 14, 2017 | 01:35 AM - Posted by Anonymous (not verified)

It wouldn't be surprising if Intel's SMT implementation (which has been tweaked through many processor generations) is a bit better than AMD's first generation implementation. That doesn't change the fact that the scheduling for SMT doesn't need to be any different. It will still prefer to use one thread per physical core first before loading multiple cores. This whole article was about how the SMT, and scheduling for it, is not being handled any differently between the two architectures. SMT is not the problem for the scheduler. Also, Intel chips can perform worse for some applications with SMT enabled. We leave it disabled where I work since the applications we use do not share cache well; it performs better with it off.

March 12, 2017 | 01:50 AM - Posted by RS84 (not verified)

how about game developers was the problem?

since RyZen is a new overall arch from AMD an its looklike almost all game developers working with intel CPU maybe..

any game tested for these scheduler things?

March 12, 2017 | 03:26 AM - Posted by Terryfried (not verified)

If Windows 10 does not have a scheduling problem with ryzen, then why are Microsoft preparing a patch for it.

March 12, 2017 | 04:42 AM - Posted by Anonymous (not verified)

The Windows 10 scheduler works just fine, it works as it was designed to work. It's just that the current Windows 10 scheduler is not updated/optimized to work efficiently on Ryzen/CCX units for Gaming workloads that are sensitive to latency.

The problem is that the windows 10 scheduler needs an update and that patch may help some! But all the extra telemetry workloads and ad pushing workloads and such does not help matters at all when that cached data/code gets fragmented(across other cores' cahes) too far from the core(and its two processor threads) that need that cached code/data. WTF M$ and AMD! Get to coding!

March 12, 2017 | 05:04 AM - Posted by Anonymous (not verified)

Allyn, I agree with you with all of your article, but one aspect. The win 10 scheduler is ALMOST OK. The l3 cache should be modified from 16MB to 8 MB. Keeping it at 16 MB will force cores from one CCX to look for data in second CCX L3 cache and will be a penalty in latency.

RYZEN R7 is a dual quad core, a Core2Quad in AMD vision(just core2quad is a quad processor not octa core and core coherency was made by FSB instead of fabric interconect). Ryzen R7 should be seen as a dual socket system

March 12, 2017 | 02:47 PM - Posted by Anonymous (not verified)

Wait for AMD to properly describe their Infinity Fabric IP, they will not talk about that IP until the Zen/Naples and Radeon/Vega products are fully RTM! NDAs are still in effect!

March 12, 2017 | 05:33 AM - Posted by ZoA (not verified)

Here are 2 simple short videos that prove PCper is full of excrement:

https://www.youtube.com/watch?v=BORHnYLLgyY
https://www.youtube.com/watch?v=JbryPYcnscA

It is perfectly clear from the videos the way Win10 distributes treads affects the performance, and Win10 is doing it in a random and suboptimal way.

Or in another words PCper article above is pure disinformation.

March 12, 2017 | 06:11 PM - Posted by Whatever707 (not verified)

Add these too:

https://www.youtube.com/watch?v=U9DE83lMVio

https://www.youtube.com/watch?v=XAXS8rYwGzg

March 12, 2017 | 10:44 PM - Posted by Allyn Malventano

We have results different from in those videos. Additional data points sure, but that does not mean our results are not factual.

March 13, 2017 | 02:12 AM - Posted by fbifido fbifido (not verified)

Allyn, ZoA video is saying the same thing as you, what ever tech is use for inter cpu communication is call is the problem.
A new scheduler can help to mitigate this problem.

So please test your code on windows 7, wanted to see the line graph.

March 13, 2017 | 02:13 AM - Posted by fbifido fbifido (not verified)

Allyn, ZoA video is saying the same thing as you, what ever tech is use for inter cpu communication is call is the problem.
A new scheduler can help to mitigate this problem.

So please test your code on windows 7, wanted to see the line graph.

March 12, 2017 | 05:54 AM - Posted by Master Chen (not verified)

If anything, 4chan's /g/ already debunked this GARBAGE of a so-called "article", completely exposing PcPer's lies and Ryan Shrout making up outright BS right out of his ass. This is especially laughable if you'll take into consideration the mere fact that Microsoft themselves already admitted that the problem is actually there, it exists, and that they're already working on the patch. And Microsoft admitted this BEFORE Ryan shat out this POS of an "article", basically completely F'ing up on spot. Glorious. Simply glorious.

March 12, 2017 | 05:59 AM - Posted by Master Chen (not verified)

Yes, not Ryan, but Allyn, whatever. PcPer is made out of Intel/Nvidia shills, so it doesn't matter who wrote that, in this particular case. I'd be the same either way. What matters is the fact of a complete and utter F up by PcPer.

March 12, 2017 | 01:33 PM - Posted by Anonymous (not verified)

The article exposed interesting information that is not obvious...
AMD fanbois discrediting anything that points to any fault being on AMD's side does not change the underlying facts one bit.

March 12, 2017 | 08:29 PM - Posted by Master Chen (not verified)

There was nothing "exposed" by them, because it's all BS pulled out of an ass. THEY were exposed for the liars that they are. Again, Microsoft already openly stated that the problem is there and that the problem is indeed in the Windows, NOT in the processor.

March 12, 2017 | 10:13 PM - Posted by Anonymous (not verified)

You just have an AXE to grind, and you could care lees about anything else! CPU/GPU makers are not football teams!

You are going all Fukushima Daiichi after the Tsunami on everyone and grasping at any excuse for a verbal fisticuffs!

March 12, 2017 | 10:49 PM - Posted by Allyn Malventano

Man, you are rally latching onto that very generic tweet from Microsoft, aren't you? Also, what exactly about the actual screen shots and test results are you claiming to be a lie? Put up or shut up.

We only put out this sort of content to educate folks and to (hopefully) steer or accelerate the relevant companies to accelerate whatever optimizations are needed to FIX THE ISSUE. We are enthusiasts. We want everything to be better / faster / etc. Nothing about this is to bash one company or specifically promote another. If you insist on interpreting it that way, I recommend you check your own bias.

March 13, 2017 | 04:55 AM - Posted by Master Chen (not verified)

You'd be better off by completely deleting this so-called self-proclaimed "article", because you were caught red-handed and smack dab in your lies, but instead you're preferring to continue on throwing the tantrum and going into full denial. The thing is - it is YOU who should've been "putting up" by now and admitting everything, but considering that the actual thief always screams "thief!" the most loudest out of all people that are present in the room, it's pretty clear you're not going to admit on anything. Not like I'm surprised by any of this, in all honesty. Them shekels won't workout themselves, naturally.

March 13, 2017 | 11:47 AM - Posted by Nintendork (not verified)

Was pretty much a intel shill detector done by the shill itself xD.

March 12, 2017 | 08:12 AM - Posted by ET3D (not verified)

This is an interesting article, however I think that an even deeper investigation is needed, given that disabling SMT on Ryzen does improve performance significantly in some cases, in a way that doesn't happen on Intel chips, and that it apparently doesn't happen on Windows 7. Perhaps the immediate suspects aren't the cause, but something is definitely screwy.

March 12, 2017 | 11:16 AM - Posted by savagejedi

Just a heads up.. To the author and to those dismissing the tweet from ms. Its easy to dismiss one you selectively choose when they actuallly responded with a more clear response to a clear question.

March 12, 2017 | 01:35 PM - Posted by Anonymous (not verified)

The MS tweet didn't say anything about if there is a problem in Windows or not... Reading too much into the tweet in either direction is just bad practice from both sides.

March 12, 2017 | 12:16 PM - Posted by knows all (not verified)

All you need to know:
https://www.youtube.com/watch?v=O54bww5zoRM

March 12, 2017 | 12:23 PM - Posted by Anonymous (not verified)

PCPer can you just add the Windows 7 tests into the mix and release your source code for the simple C tests? It would go a long way to making your audience happier.

March 12, 2017 | 12:46 PM - Posted by spkay31

Thanks for this well written and very well-researched and documented article concerning the scheduling and SMT performance issues with Ryzen. It would be great if you could share the C++ code for these tests so others could verify results on their own as well.

"While it was not our reason for performing this test, the results may provide a possible explanation for the relatively poor performance seen in some gaming workloads. Multithreaded media encoding and tests like Cinebench segment chunks of the workload across multiple threads. There is little inter-thread communication necessary as each chunk is sent back to a coordination thread upon completion. Games (and some other workloads we assume) are a different story as their threads are sharing a lot of actively changing data, and a game that does this heavily might incur some penalty if a lot of those communications ended up crossing between CCX modules."

Based on this analysis about the CCX inter-module latency issue possibly explaining the poor 1080p gaming benchmark results, it seems like the next logical step in testing would be to generate some synthetic data to be transferred and processed on threads in different CCX module cores.

Call me a skeptic, but even though I am a big fan of AMD processor technology, I would be surprised to learn that AMD CPU test engineering is not well aware of some of the issues with the CCX latency and is investigating approaches internally and with Microsoft to address them.

One of the other things that should receive more attention as well is the significant differences between the inter-core ping times of Ryzen (26ns) vs Broadwell (14ns). I am surprised that with 14nm FinFET technology and a new CPU that these times are not much closer or even slightly favor Ryzen. The other bad news this indicates is that even with a single CCX design (R5 1500x ?) the SMT processing may still be considerably less efficient than Intel's.

March 16, 2017 | 08:21 AM - Posted by Aigolat (not verified)

This must be the only (or one of the only) comments that also touches the _inter_ CCX latency issue. It does seem to me that double the access latency for data if a thread runs on a different core might make a sizable difference and (as another commenter has pointed out), the Windows scheduler with its multiple runqueues and policy of picking the first free slot will really only make matters worse. (I have read up on the internals myself, though it’s been maybe almost a year or so; I do feel like I remember enough of it.)

So thread bouncing should also be investigated, maybe Ryzen would even profit from waiting a short time before moving threads off their home CPU. (Though even a single quantum is rather large; I’ll have to admit I don’t really know what to do there.)

It might also be worth looking into the performance Linux has. Whereas everyone basically gets the same scheduler on Windows, with little there to tweak, on Linux, there have been different schedulers for a long time, if not from its inception. This should allow for easier study of the effect of different scheduling policies on performance (for the very inclined ;) ).

I, myself, remember the Completely Fair Scheduler (CFS), the BFS and the O(1) scheduler, though there seem to be others. I pulled this PDF from a short Google search: http://www.diit.unict.it/users/llobello/linuxscheduling20122013_P1.pdf . I realize it has very suboptimal formatting, but otherwise, the content appears to be sane.

To address the point risen in the final paragraph: There could be different implementations for inter-thread concurrency at play. I’m pretty sure AMD is also using a different protocol for cache concurrency. Maybe those could address at least some of the deviation.

[I don’t see any preview function, so let’s hope for the best …]

March 16, 2017 | 08:25 AM - Posted by Aigolat (not verified)

This must be the only (or one of the only) comments that also touches the _inter_ CCX latency issue. It does seem to me that double the access latency for data if a thread runs on a different core might make a sizable difference and (as another commenter has pointed out), the Windows scheduler with its multiple runqueues and policy of picking the first free slot will really only make matters worse. (I have read up on the internals myself, though it’s been maybe almost a year or so; I do feel like I remember enough of it.)

So thread bouncing should also be investigated, maybe Ryzen would even profit from waiting a short time before moving threads off their home CPU. (Though even a single quantum is rather large; I’ll have to admit I don’t really know what to do there.)

It might also be worth looking into the performance Linux has. Whereas everyone basically gets the same scheduler on Windows, with little there to tweak, on Linux, there have been different schedulers for a long time, if not from its inception. This should allow for easier study of the effect of different scheduling policies on performance (for the very inclined ;) ).

I, myself, remember the Completely Fair Scheduler (CFS), the BFS and the O(1) scheduler, though there seem to be others. I pulled this PDF from a short Google search: http://www.diit.unict.it/users/llobello/linuxscheduling20122013_P1.pdf . I realize it has very suboptimal formatting, but otherwise, the content appears to be sane.

To address the point risen in the final paragraph: There could be different implementations for inter-thread concurrency at play. I’m pretty sure AMD is also using a different protocol for cache concurrency. Maybe those could address at least some of the deviation.

[I don’t see any preview function, so let’s hope for the best …]

March 16, 2017 | 08:26 AM - Posted by Aigolat (not verified)

This must be the only (or one of the only) comments that also touches the

    inter

CCX latency issue. It does seem to me that double the access latency for data if a thread runs on a different core might make a sizable difference and (as another commenter has pointed out), the Windows scheduler with its multiple runqueues and policy of picking the first free slot will really only make matters worse. (I have read up on the internals myself, though it’s been maybe almost a year or so; I do feel like I remember enough of it.)

So thread bouncing should also be investigated, maybe Ryzen would even profit from waiting a short time before moving threads off their home CPU. (Though even a single quantum is rather large; I’ll have to admit I don’t really know what to do there.)

It might also be worth looking into the performance Linux has. Whereas everyone basically gets the same scheduler on Windows, with little there to tweak, on Linux, there have been different schedulers for a long time, if not from its inception. This should allow for easier study of the effect of different scheduling policies on performance (for the very inclined ;) ).

I, myself, remember the Completely Fair Scheduler (CFS), the BFS and the O(1) scheduler, though there seem to be others. I pulled this PDF from a short Google search: http://www.diit.unict.it/users/llobello/linuxscheduling20122013_P1.pdf . I realize it has very suboptimal formatting, but otherwise, the content appears to be sane.

To address the point risen in the final paragraph: There could be different implementations for inter-thread concurrency at play. I’m pretty sure AMD is also using a different protocol for cache concurrency. Maybe those could address at least some of the deviation.

[I don’t see any preview function, so let’s hope for the best …]

March 12, 2017 | 01:39 PM - Posted by Anonymous (not verified)

This should be developed further and put into general testing of CPU's going forward as knowing the topology of the cores and the latencies involved can be useful information for some.

Just like FCAT became part of standard testing of graphics cards, this should become part of standard testing of CPU's.

March 12, 2017 | 03:09 PM - Posted by KoolAidPitcher (not verified)

Although I would agree there is no silver bullet, I think there is a problem with how Windows is scheduling threads. In contrast to the monolithic design of Intel's processors, the Ryzen was built with a modular design-- in this case, two compute complexes with an interconnect. This seems like a good compromise between Intel's strategy of having two processor lines-- one for consumer/gaming use and another for enthusiast/professional use. AMD has instead tried to create one processor which is well rounded and capable of both types of workflow. The operating system scheduler needs to be changed to understand that not all cores are created equal and to prefer threads from the same application stay on the same compute complex. In the case where an application is more parallel than a single compute complex can handle, threads should begin spilling into the other compute complex. Also, it would be preferable to prefer balancing threads across cores instead of assigning multiple threads to the same core using SMT.

This type of a modular, NUMA on a chip design I believe is the future, and while the operating system could better handle scheduling of threads, that is only a band-aid, as the true solution to this problem is that applications need to be NUMA aware. Some applications are already NUMA aware-- video encoding such as x265 is capable of handling NUMA nodes, and many games are programmed to be NUMA aware on consoles, as the PS4 and Xbox One have a similar memory architecture to the Ryzen. I suspect that the PC ports of these games did not carry over any of the NUMA scheduling of the console game; however. Even with last generation's consoles, games had to be programmed in a way to rely heavily on thread-local storage to avoid load-hit-stores, due to poor caching design of the PowerPC, so game developers should be familiar with how to resolve these types of issues already.

March 12, 2017 | 03:09 PM - Posted by Anonymous (not verified)

The problem I believe with Ryzen is it runs too hot and stops you getting a fast clock speed on air or water. At the same clock as an Intel the Ryzen is faster on benches and possibly gaming in general but about 4ghz is the limit.
I couldn't do anything with my 1700X past 3907.47, even just opening a browser would make the system lock. I tried disabling SMT and shut down 6 cores but no cigar.

On gaming benches such a Heaven 1080p my i5 Haswell smokes it because it can run a much higher clock, even my FX6300 almost matches it at 5217mhz.

As a result the 1700X and ROG crosshair are packed up ready for RMA, I didn't pay out £650 for a down grade which is exactly what it is for me.

March 13, 2017 | 11:29 AM - Posted by Anonymous (not verified)

If you where buying a chip to over clock you should have purchased the 1700 SKU and saved even more money.

Overclocking headroom on a top end SKU is a fool's errand and the top end SKU by default should have the least amount of overcloscking headroom and for all the money spent the top end SKU better damn well be clocked(Base/Boost) to the highest possible speed out of the box or the customer is being ripped off!

Intel's entire line of "K" branded parts are just an elaborate ruse and a txetbook case of marketing psychology where Intel engineers a little more headroom into its "K" series parts to give the impression of "Overclockablity" to the uneducated consumers who fall for that "K" marketing scheme"

AMD's top end 1800X performs better with respect to clock speeds than AMD originally exoected and the 1800X is at it's proper limit by design with any overclocking headroom as small as possible. The real deal for the overclocker from AMD comes in the 1700(NON X) which is $10 dollars less expensive than the discounted 7700 "K" SKU and the 1700 offers 8 cores for a damn good price, that overclocked performs like the top binned 1800X!

That Is the real deal for any real overclocker and not some marketing driven special "K" marketing brand that is intentionally engineered with a little extra "Overclocking" headroom to meet the illogically perceived expectations of the Rubes that fell for it!

March 13, 2017 | 12:00 PM - Posted by Nintendork (not verified)

OC to get what? 10% more?

The 1700X gives twice the performance off an i5 or fx6300 with lower lower power consumption and temps.

The i5 is just a quad core, try to do a tiny little thing while gaming and the cpu cr*aps itself, do that with 1700, barely touches his belly.

By that standard a 6900K would also be a "downgrade".

March 12, 2017 | 05:33 PM - Posted by kano32 (not verified)

Hi,
This is just my opinion.
I think that all ryzen cpus have been compared with skylake and kaby lake. And in games, where single thread matters the most, intel still has the lead. When playing 1080p, with gpus like 1080 and 1080ti, the processor can become bottleneck if it cant keep up at those high framerates.

For example, lets not forget that on most benchmarks where ryzen 1800x was losing to a 7700k,
in single core performance ryzen 1800x has about 160 single core performance (cinebench r15),
whereas i7 7700k has about 193 points.

I would have loved to see the ryzen 1800x in games against a similar single thread performance CPU (hasswell) for example, the i7 4770k at stock speeds, which has 156 points in cinebench r15.

TLDR; i think the difference between the ryzen 1800x getting 90 fps and the i7 7700k getting 120 fps is precisely that single core difference.

What do you guys think?

March 12, 2017 | 06:07 PM - Posted by Whatever707 (not verified)

Also, explain this if you can:

Look at the CS:GO and Rise of the Tomb Raider videos:

https://hardforum.com/threads/amd-ryzen-7-performance-windows-7-vs-windo...

So, when are you gonna really tackle this issue and run an actual Win7 VS Win10 comparison instead of ignoring some of the data in order to push a particular prefered hypothesis despite contrary evidence?

March 12, 2017 | 06:21 PM - Posted by Josh Walrath

Win7 gaming is beyond the scope of this article.  Using simple tools and a pretty focused testing regime we get some hard numbers about what threading/scheduling looks like as well as latency across cores and CCX vs. what Intel has.  We don't have any idea why there are perf differences between the OS's, but we have a clearer idea of what is happening behind the scenes with this testing with a focus on Win10.  I'm not sure what further testing will be done this week, but it would be interesting to see these particular tests replicated on Win7.

So an analogy of what you are doing here... Tom decides to taste test the differences between a red delicious and a granny smith.  You don't like the results because he didn't include an orange!

March 12, 2017 | 06:38 PM - Posted by Whatever707 (not verified)

All we ask is for is an objective win 7 VS 10 gaming benchmarks comparison so as to have further data availlable for an objective analysis as well as to give a general idea of what to expect from Ryzen once Win 10 gets up to speed ( and perhaps better in some cases ) with Win 7.

And the reason we want that is the reason we rely on sites like this for our information: buying decision making process.

More information from such a comparison certainly can't hurt this process and could also kill off some hypotheses / narrow down the search for the ( sometimes massive ) gaming performance discrepancies observed between the 2 OSes.

I think that's a pretty reasonable expectation considering the current context and data availlable so far. The analogy is simply a non-sequitur considering the motivations involved, a strawman logical fallacy iows.

March 12, 2017 | 06:58 PM - Posted by Anonymous (not verified)

Go to Anandtech's Ryzen: Strictly technical forum and read the posts there! That thread has been ongoing with the fouum members trying/testing on Linux/Windows 7/10 and each day more is added!

All of the Infinity Fabric details are not going to be released by AMD until the Zen/Naples and Radeon/Vega SKUs are RTM so some NDAa are still ongoing!

Maybe Cigarette Man may chime in on some forum in Dresden but the truth is out there! Oh! If only Anand Lal Shimpi was not currently sealed up so snugly in carbonite in that special cript deep below that round space-ship headquarters in the valley of silicon in Cali!

March 12, 2017 | 07:27 PM - Posted by Josh Walrath

Oooh, non-sequiter strawman logical fallacy! Let's throw in some false equivalency to complete the trifecta!

I am curious what you truly think are the motivations involved?  The question initially addressed is "is the Win10 schedular broken when assigning threads to cores and what kind of performance issues could we potentially see with inter-CCX communications."  Results show schedular is working as expected in these situations.  Latency for core to core communications is pretty high.

Easy, right?

March 13, 2017 | 09:32 AM - Posted by kidchunk (not verified)

https://youtu.be/JbryPYcnscA

March 13, 2017 | 04:36 PM - Posted by Anonymous (not verified)

As I understand it they LOCK (set the affinity for) the threads in the program that does the "pinging" between them.

The OS scheduler makes NO difference in this case (unless it
s so completly broken it doesn´t follow instructions)

March 13, 2017 | 12:16 AM - Posted by trued_2 (not verified)

Hey Ryan or Allen could you guys test the thread to thread latency on older AMD cpu's such as Bulldozer or Piledriver to see if they have the same issues with increased latency module to module? In theory they should have the same problem, unless MS already "fixed" the issue for those architectures in the past. If the issue remains there could be performance yet to be seen in the older architectures should a fix from MS/AMD arrive; as well as more concert for why this issue wasn't addressed long ago. For Science?

March 13, 2017 | 12:21 AM - Posted by trued_2 (not verified)

Sorry I spelled you're name wrong Allyn.

March 13, 2017 | 12:21 AM - Posted by trued_2 (not verified)

Sorry I spelled your name wrong Allyn.

March 13, 2017 | 01:13 AM - Posted by RyanH (not verified)

Great writeup. This kind of content is why I love PC Perspective.

March 13, 2017 | 02:39 AM - Posted by justwalle (not verified)

So how will naples work if they do something with the NUMA thing?

March 13, 2017 | 02:41 AM - Posted by justwalle (not verified)

Maybe AMD should use nipples driver for it? I mean naples. I am human so I have 2? female dogs have many of these.

March 13, 2017 | 02:42 AM - Posted by justwalle (not verified)

So AMD should use nipples driver? stderr I mean naples? Cows and wolves, female, have many of these. lolz.

March 13, 2017 | 02:39 AM - Posted by justwalle (not verified)

So how will naples work if they do something with the NUMA thing?

March 13, 2017 | 05:10 AM - Posted by dragosmp (not verified)

Hi Allyn, thanks for trying to clear this thing up and for trying to moderate the comments, but resistance is futile. You are clearly an anti AMD fanboy and the numbers you posted are clearly just "expert opinions" made to put in a bad light what is clearly the best CPU that ever existed

/end sarcasm

Can't wait for the next episode where you'll hopefully throw some more mumbo jumbo expert data onto how maybe the second CCX's L3 cache is being used by the 1st CCX even though the 2nd CCX's cores are idle.

/really end srcasm

You did a great job, you did answer what you were chasing for - now I'm almost sure it's not the SMT (I'm a researcher, I still want independent confirmation). It's your fault for not answering every question everyone has, but I hope you strive for perfection and come up with the magic bullet to solve Ryzen's gaming performance - or at least why it ain't gonna happen no matter the amount of smoke an fanboy mirrors.

March 13, 2017 | 09:30 AM - Posted by kidchunk (not verified)

https://youtu.be/JbryPYcnscA

March 13, 2017 | 05:30 AM - Posted by Anonymous (not verified)

"even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance."

I think this sentence, that was bolded, is poorly phrased and caused most of the confusion in the comments.

You have demonstrated windows does not have issues in regards to labeling the cores or with SMT.

However you do indeed discuss (and more in your video) that windows scheduler is not aware of the CCXs. Therefore it would indeed be correct to conclude that the windows 10 scheduler DOES have a problem and could be causing low gaming performance.

You needed to be specific with your conclusion (that you were testing SMT ...) and not generalize "the scheduler is fine" when there are other possible issues (the CCXs)...

March 13, 2017 | 06:05 AM - Posted by Austin P (not verified)

Thank you! This is great work.

Everyone talks about gaming, but many lightly-threaded workloads suffer the same lackluster performance on Ryzen as many games. In Autodesk Inventor (3D CAD), a similarly clocked Xeon with 6 cores regularly outperforms Ryzen by a considerable amount. With many workstation applications being lightly-threaded, an Intel CPU is preferable.

I was hoping for bug fixes to find this missing performane. We need more competition in the highend. But, this lacking performance could simply be a result of the architecture.

We need to see how it turns out, but it doesn't look good. I may get a Xeon E5-1650v4 or a Core i7 6800K for my animation workstation instead of a Ryzen 7. I will wait, though, to see what happens.

March 13, 2017 | 04:10 PM - Posted by Anonymous (not verified)

Peer reviewed Sources please! And do include any professional Trade journals if you can!

Your comments appear anecdotal!

March 13, 2017 | 06:56 AM - Posted by Alamo

great video follow up, at least comments are being usefull in pushing things forward :p

March 13, 2017 | 11:25 AM - Posted by Nintendork (not verified)

Now guys, this is perfect example of the "call us before you write" by intel. Anyone can prove that the win10 scheduler is messing with the CCX module and thread assignation vs Windows 7.

Never visit this site again.

March 13, 2017 | 11:25 AM - Posted by Nintendork (not verified)

Now guys, this is perfect example of the "call us before you write" by intel. Anyone can prove that the win10 scheduler is messing with the CCX module and thread assignation vs Windows 7.

Never visit this site again.

March 13, 2017 | 02:05 PM - Posted by Anonymous (not verified)

Please for the love of god write a TL;DR. I don't have the time to read this hours long article.

March 13, 2017 | 03:32 PM - Posted by why_me (not verified)

Isn't there something wrong with the Intel latency. To my understanding, the further away the core is, the higher the latency should be.

March 13, 2017 | 07:08 PM - Posted by Thomas Bruckschlegel (not verified)

Wrong. Intel uses a ring bus. http://cdn.wccftech.com/wp-content/uploads/2013/12/Intel-Ivy-bridge-EP-D...

March 13, 2017 | 07:04 PM - Posted by Thomas Bruckschlegel (not verified)

I think the Win10 scheduler is correct in what it is doing for SMT. I see the bigger problem in CXX itself. It's a mixture if NUMA and non-NUMA architecture, where you can remove the M for memory.
Instead you have multiple 2 core, 4 thread cluster - not multiple sockets(with x cores, 2*x threads) where each socket has it's own memory channels.
So, can a OS scheduler fix this?
No!
A scheduler does not know about an app's thread interdependencies / cross thread communication workload.
The same is true for NUMA, but here a OS can assist, because memory allocations can be bound to sockets to achieve max. performance and to minimize cross socket communication.
The impact of CXX can only be minimized at the application level, which means patches and additional development work is required per application!!!

BTW, DX12 is not multithreaded (video at 18:00) - engineers have to use it in a multithreaded way, if they want ;)

March 13, 2017 | 08:35 PM - Posted by Anonymous (not verified)

Parts of AMD's statment from: rhallock Employee in Gaming on Mar 13, 2017 3:39:32 PM looked at:

"Thread Scheduling

We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture.

As an extension of this investigation, we have also reviewed topology logs generated by the Sysinternals Coreinfo utility. We have determined that an outdated version of the application was responsible for originating the incorrect topology data that has been widely reported in the media. Coreinfo v3.31 (or later) will produce the correct results.

Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes.

Going forward, our analysis highlights that there are many applications that already make good use of the cores and threads in Ryzen, and there are other applications that can better utilize the topology and capabilities of our new CPU with some targeted optimizations. These opportunities are already being actively worked via the AMD Ryzen™ dev kit program that has sampled 300+ systems worldwide.

Above all, we would like to thank the community for their efforts to understand the Ryzen processor and reporting their findings. The software/hardware relationship is a complex one, with additional layers of nuance when preexisting software is exposed to an all-new architecture. We are already finding many small changes that can improve the Ryzen performance in certain applications, and we are optimistic that these will result in beneficial optimizations for current and future applications."(1)

Parts included: the whole Thread and SMT related parts of statment!

So no issue with the windows Scheduler doing what is was programmed to do!

So now on to the:

"we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."(1)

There is the "we do not presently believe"(1) with a focus on the "presently" word!

What other Windows Display Driver Model, Bloatware, UWP, games, graphics APIs, other issues, that may there that may be causing any trouble.

Then there is this:

"and there are other applications that can better utilize the topology and capabilities of our new CPU with some targeted optimizations."(1)

That "topology and capabilities" part of the statment is very interesting and needs to be looked at.

"Simultaneous Multi-threading (SMT)

Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.

For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."(1)

so SMT is generally neutral/positive:

"it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this"

Then this part of the SMT responce:

For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."(1)

Again with the: " "Zen" core/cache topology".

So maybe some games are setting core affinity themselves and doing it is such a way that performance suffers on The Zen/CCX units to the detriment of the game's performance(?).

Is there any NDA between AMD and M$ with respect to avoiding any problem issues or any issues being discussed at all in public, Intel sure uses NDAs that way so it's save to assume that there may be NDA related issues to be covered.

(1)

https://community.amd.com/community/gaming/blog/2017/03/13/amd-ryzen-com...

March 13, 2017 | 11:56 PM - Posted by Anonymous (not verified)

Your ns measurements don't seem to match other reviewers: http://www.hardware.fr/articles/956-24/retour-sous-systeme-memoire-suite...

March 14, 2017 | 05:53 AM - Posted by tomiw (not verified)

Hi,

Is there a chance to do core ping test on 2 cpu intel xeon machine?
It will be very interesting :)

Or maybe you can share binaries and and i can test that on 2cpu xeon server :)

Cheers,
Tomek

March 14, 2017 | 07:28 AM - Posted by Anonymous (not verified)

Ryan
You ware curious about impact of cross-CCX communication on performance in games:
http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review...

the biggest difference between 4+0 config (one CCX disabled) and 2+2 config (2 cores in each CCX disabled) was in BF1:
http://www.pcgameshardware.de/screenshots/original/2017/03/Ryzen-R7-1800...

but still not that big, and in all 4 games 3+3 config was much faster (and full 4+4 even more) in all games and every point in the histogram (still frametimes graph would be better). So it seems that cross-CCX communication issue is a bit exaggerated, and probably that 80ms+ memory access time have bigger impact in games. Poor IMC it seems, but faster RAM can help by fair margin:
http://www.legitreviews.com/wp-content/uploads/2017/03/deus-ex-gaming.jpg (only 1 game tested unfortunately)

March 14, 2017 | 08:25 AM - Posted by drayzen (not verified)

Hi guys,
Thanks for continuing to dig in to this, very interesting.
Could you post some of the ping charts with RAM overclocked at various speeds?
I'm really curious now to see how each of the three different latencies are affected...
If you really feel like digging in to it further, perhaps also same speed w/different timings?
Can we get a link to download the app you're using to ping the cores?

March 14, 2017 | 08:47 AM - Posted by Anonymous (not verified)

The issue is not with SMT, or the windows scheduler (directly) it's because some games are using the wrong 'topology map' with Ryzen.

https://community.amd.com/community/gaming/blog/2017/03/14/tips-for-buil...

By tweaking some settings in F1s topology information they got a 36% performance increase. No doubt there are other games suffering from similar problems. But if the fix is as simple as editing a text file, I wouldn't expect it to be an issue for long.

How much effect it will have on individual games will vary, and some that already use the correct topology map probably won't see any increase.

Anyway, the fault is known, and they are working on it. So we'll just have to wait and see.

March 14, 2017 | 01:53 PM - Posted by Dude (not verified)

Hello,

Would it be possible to do some benchmarks by setting the affinity of the benchmark/game process manually to only the cores in a single CCX?

There is a simple way of doing that from task manager:
http://www.tech-recipes.com/rx/37272/set-a-programs-affinity-in-windows-...

March 14, 2017 | 04:24 PM - Posted by Anonymous (not verified)

Lies upon lies to cover computer media asses & Microsoft!

You have to have a specific "updated" version of Window 10 that has quietly been fixed - funny how it fails to appear on our screens in your faked threads!

None of the previous benchmarks in the press were accurate. Even AMD are telling porkies as we found an 18% uplift in benchmarking on Windows 7 vs the older versions of Win 10!

March 23, 2017 | 09:48 PM - Posted by Agent005 (not verified)

AMD said they "believe" Windows 10 is behaving properly in terms of its scheduler...but users are seeing some pretty dramatic improvements after the recent Windows 10 update on Mar 16:
https://www.reddit.com/r/Amd/comments/60g5cj/windows_10_update_on_16_mar...

What are your thoughts?

August 17, 2017 | 12:17 PM - Posted by Saul Luizaga (not verified)

Can you please abandon the stupid Reddit-style? and make a more coherent way of making sub-threads on the comment section? or improve it with a link to the next logical post on the same sub-thread, because ethe way it is now, is just a bunch of post that don't make much sense

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.