Let us see how you handle my Threadripper's new NUMA Dissociater attack!

Subject: General Tech | January 15, 2019 - 02:57 PM |
Tagged: amd, NUMA, Threadripper, numa dissociator, coreprio

With Threadripper, AMD introduced something new and different to the market, a HEDT architecture with nonuniform memory access.  This has met with mixed results, as is reasonable to expect from such a different chip design.  There has not been much out of Redmond to adapt Windows to handle this new design compared to the amount of work coming out of the enthusiast community, especially those using Linux.

Phoronix has recently benchmarked a piece of software from CorePrio called NUMA Dissociater on both Windows and Linux.  It was designed to better address some performance issues on the Threadripper 2990WX and 2970WX than AMD's Dynamic Local Mode which can be enabled if you run their Ryzen Master software.  As you can see in the full review the results are not earth shattering, nor do they always increase performance, but the foundation for improvement is fairly solid. 

View Full Size

"Here are some benchmarks of Windows 10 against Linux while trying out CorePrio's NUMA Dissociater mode to see how much it helps the performance compared to Ubuntu Linux. Additionally, tests are included of Windows Server 2019 to see if that server edition of Windows is able to offer better performance on this AMD HEDT NUMA platform."

Here are some more Processor articles from around the web:

Processors

Source: Phoronix

January 15, 2019 | 04:26 PM - Posted by LopsidedNUMA (not verified)

M$'s Thread Scheduler version of musical chairs on those asymmetrical NUMA node TR2(TR 2990WX and TR 2970WX) SKUs is what's causing the performance degradation. All that thread hopping and the cache thrashing about to maintain coherency is what's reducing performance. Move a thread across a CCX boundry or even a Die boundry and cache requests are having to cross boundries that add to the latency.

Zen-2 and every Die/Chiplet getting a 1 jump far access to the 14nm I/O die will even things out a bit. And Zen-2's larger cache sizes including tons more L3 cache should help all around to reduce the latency inducing system memory/DRAM accesses. Now if that 14nm I/O die had maybe some L4 cache that would help things also. But Zen-2 is still under NDA so that will have to wait for release date for that to be known.

As far as TR2(TR 2990WX and TR 2970WX) M$/AMD needs to work up some scheduler fix to adress the problem. But with Zen-2 on the way all TR3 SKUs will be Zen-2/Die-chiplet based with the memory channels all on the 14nm I/O die and that problem should go away.

Please let there also be some L4 cache on the 14nm I/O die with the logic to snoop the higher levels of caches on the Zen Dies/Chiplets and some AI based predictor logic that can keep most of the main gaming logic/code on some cache level before any DRAM access is needed.

" There has not been much out of Redmond to adapt Windows to handle this new design compared to the amount of work coming out of the enthusiast community, especially those using Linux."

That's because Linux has always been better at NUMA owing to the majority of servers using Linux Kernel based OS Distros for so many years. It's definitely not the Year of NUMA for Desktop Windows! So NUMA has arrived for the HEDT but M$ has some more work to do there compared to the Linux Kernel/Linux Scheduler!

Gaming code really needs to be able to set core/thread affinity and keep that for the duration of the game with no OS jumping in to move things around for the gaming application while it's still running. The only moving around of threads that should be done by the OS is to clear enough cores/threads on the same die for the gaming application to make use of. So that should be the OS clearing up at least 8 cores/16 threads on a single die with near memory access on those TR2(TR 2990WX and TR 2970WX) SKUs that have some dies without their own memory controllers.

The OS should be able to give any gaming application the ability to pin one of TR DIE for the game's Cores/Threads needs that has 2 memory channels for that gaming application's exclusive use and use the Other die with its own memory 2 controllers to service and OS/Services related threads and memory requests from the 2 other dies that have no memory controllers.

January 16, 2019 | 07:53 AM - Posted by gamerk2 (not verified)

Part of the issue is the way Microsoft designed their scheduler. At a really high level, the highest priority thread(s) at any point in time ALWAYS run. While great for maximizing bandwidth for a single application that doesn't use a lot of threads, once thread count starts to go up, Windows has a nasty tendency to start bouncing threads around between cores depending on what individual threads have the highest priority at that specific point in time.

This really doesn't work in a NUMA environment, as memory/cache coherency issues will outright tank performance. In that respect, Linux's thread pools offer much more predictable performance.

January 16, 2019 | 05:47 PM - Posted by James

Threadripper is a very small market segment. Epyc is NUMA also, but most Epyc servers are probably xrunning Linux. Zen 2 will not be NUMA anymore, except for dual socket Epyc, so there isn’t going to be much interest in rewriting the windows scheduler for just Threadripper. Windows is probably still way behind in some applications even on Epyc with all chips having local memory, so they really should do something different. Rewriting the core scheduler isn’t a quick job though.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.