Podcast #509 - Threadripper 2950X/2990WX, Multiple QLC SSDs, and more!

Subject: General Tech | August 16, 2018 - 03:16 PM |
Tagged: xeon, video, Turning, Threadripper, ssd, Samsung, QLC, podcast, PA32UC, nvidia, nand, L1TF, Intel, DOOM Eternal, asus, amd, 660p, 2990wx, 2950x

PC Perspective Podcast #509 - 08/16/18

Join us this week for discussion on Modded Thinkpads, EVGA SuperNOVA PSUs, and more!

You can subscribe to us through iTunes and you can still access it directly through the RSS page HERE.

The URL for the podcast is: http://pcper.com/podcast - Share with your friends!

Hosts: Ryan Shrout, Jeremy Hellstrom, Josh Walrath, Allyn Malventano

Peanut Gallery: Ken Addison, Alex Lustenberg

Program length: 1:35:10

Podcast topics of discussion:
  1. There is no 3
  2. Week in Review:
  3. News items of interest:
  4. Other stuff
  5. Picks of the Week:
  6. Closing/outro
 
 
Source:

August 16, 2018 | 08:04 PM - Posted by OhThisDamnPorridgeBlowsChunksCuzThatsWhatItsMadeOf (not verified)

Buildzoid's got another video up, this time he's ranting about the MB's he is currently using, AM4/TR4 MBs mostly, on this video and overclocking and PBO/other features along with the BIOS settings Love/Hate sort of BIOS issues. One thing is certian as far as BZ is concerned is that he has yet to get a MB with any sort of Good VRM matched with a Good BIOS/BIOS-UI or proper full feature set of BIOS settings for OC. So Goldilocks sill can not find that just right bowl of MB VRM/BIOS/VRM-Heatsink OC porridge.

That Kazon-Nistrim sect level sort of bad bedhead hair he's sporting may just shock you at first on that video, but you'll feel for his frustration at the amount's of crap the MB makers are throwing out there lately.

August 16, 2018 | 08:22 PM - Posted by FiresSalesOnFirstGenRyzen (not verified)

Is $179.99 a good price for a Ryzen 7 1700? Some poster over at r/Amd, about 5 Hours ago, is ranting about that price over at NewEgg with a promo code.

August 17, 2018 | 03:34 AM - Posted by Anony mouse (not verified)

Thumbnail caption

"Oh, that's right. Football season is upon us and i'm still a Bengals fan"

August 18, 2018 | 03:36 PM - Posted by James

I still don't know what you (Ken) are talking about when you say that the infinty fabric is "half speed" in the 4 die configuration vs. the two die configuration of ThreadRipper.

August 19, 2018 | 04:25 PM - Posted by YourOwnBootStrapsHelps (not verified)

Read (1) and (2) and even the Links provided in (1) and (2) and try not to be dependent on any marketing driven websites for the total picture. Slideshare has some AMD professional Trade Show slide presentations also and YouTube usually get those very same presentations some time after the professional Trade show is over.

Buildzoid's videos can be helpful as well as Der8auer's(He has some Engineering Background). Real World Tech and Anandtech forums with any Posts bt "The Stilt" on the Ryzen Strictly Technical" thread and others at anandteh forums.
If you want there is the Pay walled publication: "Microprocessor Report" and the other more peer reviewed academic computing trade journals under the main heading of IEEE Computer Society and others. Hot Chips: A Symposium on High Performance Chips, atarts today with that bigevent running from August 19-21, 2018.

(1)

"AMD Announces Threadripper 2, Chiplets Aid Core Scaling"

https://fuse.wikichip.org/news/1569/amd-announces-threadripper-2-chiplet...

(2)

"Infinity Fabric (IF) - AMD"

https://en.wikichip.org/wiki/amd/infinity_fabric

August 19, 2018 | 06:56 PM - Posted by James

I have read the wikichip article before and I do not see any answer to this question. Going from 2 die to 4 die goes from one link to one other die to 3 links to 3 other die. It is fully connected with bi-directional links (transfers in both directions can happen simultaneously). The memory bandwidth is obvoisly shared, so there will be more stress on the memory controllers and more stress on the links going between the die with memory controllers and those without. It has more links though. Remote memory is still remote memory. It you have memory interleaved, each die without a memory controller would be accessing memory across two links, not one. If memory isn't interleaved, then you are accessing remote memory across one link, but not the same link as the other die. They both have separate links to the dies with memory controllers. The two die with memory controllers also have a sparate link between them, same link as in the two die threadripper. There isn't anything here that I would attribute to being "half speed", as far as the infinity fabric is concerned. The link between the two die with memory controllers wouldn't be getting any extra memory transfers at all; they would all be going over the other links that are not present in a two die config.

There is obviously going to be more demand on the memory controllers with up to 16 extra cores using them. The average latency also takes a hit for the remote cores. The memory bandwidth available is obviously the same between 2 and 4 die variants, but the infinty fabric scales with the number die in a single socket since it is fully connected. It is only when you go to 2 sockets that you effectively have shared links since it is connected like a cube with an 'X' on each end; fully connected within one socket, but only 4 links going to the other socket. If you have the 'X' on the top and bottom, then going from the top corner to the opposing bottom corner of the cube is two hops.

I suspect the lower performance for the wx series seen in some applications is in large part due to Windows scheduler and applications not designed to play well with NUMA and/or many cores. The wx threadripper looked spectacular in the phoronix linux review, but that is on an OS with a lot of NUMA optimizations (most HPC is on linux these days) and with applications that probably have at least been tested on NUMA systems, if not explicitly optimized for NUMA. I don't know what settings windows has available. For any threadripper, it might be interesting to try some different mappings, like binding the Nvidia driver to one die (if possible), preferably the one actually connected to the gpu, and then binding the game engine / code to the other die. The gpu is directly connected to one of the cpu die of the 4. It may be that it just isn't going to perform well without optimizations in the application for NUMA.

August 19, 2018 | 11:35 PM - Posted by ItsVeryComplexTheNutsAndBoltsOfTheIFareNotFullyRevealed (not verified)

I myself do not Know what that "half speed" is about so if you can not find out any information then ignore that "half speed" statment because the Wikichip links are a good source compared to Ken's statment.

For everything going through the CAKE and the CAKE operating at memory clock speed then everything is going at the same speed but different latencies for the 2 compute dies with far memory access. Latency will go up for any far memory requests but the data clock speed over the CAKE and IF is at that fixed speed according to whetever the memory clock speeds are.

the only thing that I see in the Wikichip documentation is this statment:

"For a system using DDR4-2666 DIMMs, the CAKEs will be operating at 1333.33 Mz meaning the IFIS will have a bidirectional bandwidth of 37.926 GB/s."(2)

That's for the IFIS intersocket(for 2p/Dual socket systems) links and the Infinity Fabric On-Package (IFOP) links anre not mentioned but where it does mention that Half is concerning that 2666 memory that is clocked at 1333.33 Mhz, so for memory of that speed the CAKE clock rate is 1333.33 Mhz.

IF/SDF and memory access via the CAKE's clock will be tied to that memory clock speed. Single Ranked DIMMs at one per channel is probably the only way to get the highest memory clocks but it's Latency not speed that's a factor on any TR2 Compute dies trying to access any far memory on the SOC dies that do have those memory controllers.

But also there is this:

"Due to the performance sensitivity of the on-package links, the IFOP links are over-provisioned by about a factor of two relative to DDR4 channel bandwidth for mixed read/write traffic. They are bidirectional links and a CRC is transmitted along with every cycle of data. The IFOP SerDes do four transfers per CAKE clock."(2) [See links above]

So for the IFOP(Infinity Fabric On Package) the SerDes do Quad transfers per CAKE clock, and the CAKE runs at memory clock speed. That overprovisioning does help keep the extra memory channel requests from affecting TR2 32/24 core as much but does nothing to help the added IF latencty of any threads running on the Compute dies when they need to issue R/W traffic requests via far memory accesses.

This performance result on Windows being much worse than on Linux is like you have said probably related the the Windows scheduler not being aware of TR2's Memory Topology and that's not a normal sort of arrangement that TR2 24/32 core SKUs will have with that 2 Compute Dies and the other 2 SOC Dies that have their own local memory channels. Maybe even the Games need to be asking for SOC die affinity for the major gaming control threads, and there is a gaming mode for TR2 that reduces the active dies down for gaming.

All the games Draw Call Threads really need to be located on a single SOC die to reduce latency to a minimum for gaming but any other non latency affected gaming tasks can be done on the other dies starting with the other SOC die that has its own local memory and then the compute dies.

TR2 is so new with that new SOC die and Compute die topology that games need time to be refactored to recognise the new topology. Linux being used for server more than windows means that the Linux Kernel is very Tuned for NUMA workloads conpared to consumer Windows. Windows 10's bloat is also probably taking up too much of the core/threads on the dies with local memory access and that needs to be stopped also in addition to MS getting some better NUMA aware scheduling working for its OS.

Hopefully on Zen2 AMD can do more to improve memory latency and the lowest hannging fruit for any latency improvments is a larger L3 cache. Games software will want to make more use of the TR2 SOC dies for draw calls and set die affinity for the draw calls, and even keep the OS from moving around active draw call threads while they are still doing work. Zen/Zen+ is still mostly first generation Zen with Zen+ getting some minor tweaks and maybe for Zen2 some of the latency issues can be solved.

There is still information missing from the Block diagrams but those SerDes can run much faster if needed at the cost of extra power used. The CAKEs and also the CCMs need deeper dives into their workings and refrence(2) also mentions

"From an operational point of view, the IOMS and the CCMs are actually the only interfaces that are capable of making DRAM requests."(2) [see links above in the first post of this thread]

There is a Slide Share presentation on Epyc at the Cool Chips conference(Cool Chips concerns enegry efficiency on processors) also but the slides have more detailed block diagrams. Epyc and TR/TR2 are still similar enough that these slides will help but maybe there is a video with that actual sppeaker's content:

"From Cool Chips 2018, Jay Fleischman, Senior Fellow at AMD, presents AMD EPYC™ Microprocessor Architecture "

https://www.slideshare.net/AMD/amd-epyc-microprocessor-architecture

August 21, 2018 | 09:08 PM - Posted by James

You don't need to reply to every post, especially when it is specifically targeted at some one else. The only way to resolve this is for Ken to explain what he was talking about. He said it both in the review and on the podcast, so I wanted to make sure I hadn't missed something. Ryan immediately agreed with him also. There would possibly be more snoop traffic going over the diagonal link between the two die with memory, but it should not see much of any extra memory traffic. The added compute die have their own dedicated links to the other die. Distributed architectures have their advantages.

August 19, 2018 | 01:13 PM - Posted by ColossalCollapse (not verified)

About the pronounciation of "Turing" (yeah, i watched this episode only today; i have to take time where i can get it)...

If i am not mistaken, the GPU architecture is named after Alan Turing. The "U" in his name is indeed pronounced like om "Touring" and absolutely not like in "Tina Turner". The British government thoroughly butchered his life, please don't butcher his name too... ;-)

August 19, 2018 | 01:14 PM - Posted by ColossalCollapse (not verified)

s/om/in/ ...sigh...

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.