Subject: Processors | December 31, 2018 - 09:46 PM | Tim Verry
Tagged: SMT, self driving car, cortex A65AE, armv8-a, arm safety ready, arm
Over the holidays I noticed that ARM released information on a new core design aimed at autonomous driving systems. The Cortex-A65AE is part of the company's Automotive Enhanced lineup and follows on the Cortex-A76AE) with its split-lock and other features that are part of ARM's Safety Ready program.
Aimed at processors that will be used in self driving cars, advanced driver assistance systems (ADAS), aviation, and industrial automation, the Cortex-A65AE core design integrates several safety and redundancy features that meet ASIL D specifications which is a hazard and risk assessment test for an ISO standard (26262) focused on road vehicle safety. Processors will be able to have up to eight cores and will support SMT with each physical core able to run two threads (at different exception levels and/or under different OSes). The cores can be run independently for performance or in lock step for redundancy and integrity checking comparing each other's calculation results (Split-Lock and Dual Core Lock Step respectively). Using the simultaneous multithreading, two threads on a physical core and operate in lock step mode with two other threads on a different physical shadow core according to Anandtech.
ARM has not yet released full details about the Cortex-A65AE core but it utilizes a 6A65AE4-bit out of order execution pipeline with the. ARMv8-A. It can be customized to suit the needs of ARM's partners so exact chip specifications will differ, but in general Cortex-A65AE cores can have 16 to 64 KB L1 instruction and data caches, 64 to 256 KB L2 cache, an optional L3 cache up to 4MB. Other features include support for ARM TrustZone, ECC memory, and ACP connections for accelerators. The new cores are built with ARM's DynamIQ technology and are slated to be used in chips built on the 7nm process node.
According to ARM, Cortex-A65AE cores are 70% faster in integer performance per core and offer up to 3.5 times the memory throughput and six times the read bandwidth for ACP accelerators versus the existing Cortex-A53 cores. The notable performance jump is likely the result of a combination of moving to a smaller process node, the addition of SMT, and architectural improvements and cache and inter-chip routing optimizations.
ARM is positioning the Cortex-A65AE as complementary to the Cortex-A76AE which is to say that the new core is not a direct replacement for it. While the Cortex-A76AE is high performance, the A65AE is high throughput and both cores reportedly have their place in future ADAS and self-driving cars. The Cortex-A65AE cores can be clustered together to do the initial processing and sensor fusion calculations from all of the inputs from cameras, radar, lidar, and other hardware. From there, clusters including Cortex A76AE chips (or a mix of the two) along with other accelerators can be responsible for making the decisions based on the sensor information. How well it works in practice and how this heterogenous setup will compare to competing offerings from NVIDIA, Intel/MobileEye, and others remains to be seen. I am all for the self-driving car future though so the more competition and developments in that space is always nice to see even if it's still a ways off yet!
The Cortex-A65AE being the first Cortex-A core to feature multithreading is also interesting and I am very curious if we will see that capability expanded to other ARM processors outside of the AE series. While SMT may not be worth it for mobile devices like smartphones and even tablets, perhaps future ARM-powered Always Connecred Windows notebook PCs will use processors with SMT capable cores as it would be easier to justify the extra cost in power and size to include multithreading.
What are your thoughts?
(PS I hope everyone had a safe holiday or at least a good week if you don't celebrate! I am looking forward to 2019 and continuing to serve you with
bad puns and allegedlys technology coverage!)
Subject: Processors | March 13, 2017 - 08:48 PM | Sebastian Peak
Tagged: Windows 7, windows 10, thread scheduling, SMT, ryzen, Robert Hallock, processor, cpu, amd
AMD's Robert Hallock (previously the Head of Global Technical Marketing for AMD and now working full time on the CPU side of things) has posted a comprehensive Ryzen update, covering AMD's official stance on Windows 10 thread scheduling, the performance implications of SMT, Windows power management settings, and more. The post in its entirety is reproduced below, and also available from AMD by following this link.
It’s been about two weeks since we launched the new AMD Ryzen™ processor, and I’m just thrilled to see all the excitement and chatter surrounding our new chip. Seems like not a day goes by when I’m not being tweeted by someone doing a new build, often for the first time in many years. Reports from media and users have also been good:
- “This CPU gives you something that we needed for a long time, which is a CPU that gives you a well-rounded experience.” –JayzTwoCents
- Competitive performance at 1080p, with Tech Spot saying the “affordable Ryzen 7 1700” is an “awesome option” and a “safer bet long term.”
- ExtremeTech showed strong performance for high-end GPUs like the GeForce GTX 1080 Ti, especially for gamers that understand how much value AMD Ryzen™ brings to the table
- Many users are noting that the 8-core design of AMD Ryzen™ 7 processors enables “noticeably SMOOTHER” performance compared to their old platforms.
While these findings have been great to read, we are just getting started! The AMD Ryzen™ processor and AM4 Platform both have room to grow, and we wanted to take a few minutes to address some of the questions and comments being discussed across the web.
We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture.
As an extension of this investigation, we have also reviewed topology logs generated by the Sysinternals Coreinfo utility. We have determined that an outdated version of the application was responsible for originating the incorrect topology data that has been widely reported in the media. Coreinfo v3.31 (or later) will produce the correct results.
Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes.
Going forward, our analysis highlights that there are many applications that already make good use of the cores and threads in Ryzen, and there are other applications that can better utilize the topology and capabilities of our new CPU with some targeted optimizations. These opportunities are already being actively worked via the AMD Ryzen™ dev kit program that has sampled 300+ systems worldwide.
Above all, we would like to thank the community for their efforts to understand the Ryzen processor and reporting their findings. The software/hardware relationship is a complex one, with additional layers of nuance when preexisting software is exposed to an all-new architecture. We are already finding many small changes that can improve the Ryzen performance in certain applications, and we are optimistic that these will result in beneficial optimizations for current and future applications.
The primary temperature reporting sensor of the AMD Ryzen™ processor is a sensor called “T Control,” or tCTL for short. The tCTL sensor is derived from the junction (Tj) temperature—the interface point between the die and heatspreader—but it may be offset on certain CPU models so that all models on the AM4 Platform have the same maximum tCTL value. This approach ensures that all AMD Ryzen™ processors have a consistent fan policy.
Specifically, the AMD Ryzen™ 7 1700X and 1800X carry a +20°C offset between the tCTL° (reported) temperature and the actual Tj° temperature. In the short term, users of the AMD Ryzen™ 1700X and 1800X can simply subtract 20°C to determine the true junction temperature of their processor. No arithmetic is required for the Ryzen 7 1700. Long term, we expect temperature monitoring software to better understand our tCTL offsets to report the junction temperature automatically.
The table below serves as an example of how the tCTL sensor can be interpreted in a hypothetical scenario where a Ryzen processor is operating at 38°C.
Users may have heard that AMD recommends the High Performance power plan within Windows® 10 for the best performance on Ryzen, and indeed we do. We recommend this plan for two key reasons:
- Core Parking OFF: Idle CPU cores are instantaneously available for thread scheduling. In contrast, the Balanced plan aggressively places idle CPU cores into low power states. This can cause additional latency when un-parking cores to accommodate varying loads.
- Fast frequency change: The AMD Ryzen™ processor can alter its voltage and frequency states in the 1ms intervals natively supported by the “Zen” architecture. In contrast, the Balanced plan may take longer for voltage and frequency (V/f) changes due to software participation in power state changes.
In the near term, we recommend that games and other high-performance applications are complemented by the High Performance plan. By the first week of April, AMD intends to provide an update for AMD Ryzen™ processors that optimizes the power policy parameters of the Balanced plan to favor performance more consistent with the typical usage models of a desktop PC.
Simultaneous Multi-threading (SMT)
Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.
For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready.
Overall, we are thrilled with the outpouring of support we’ve seen from AMD fans new and old. We love seeing your new builds, your benchmarks, your excitement, and your deep dives into the nuts and bolts of Ryzen. You are helping us make Ryzen™ even better by the day. You should expect to hear from us regularly through this blog to answer new questions and give you updates on new improvements in the Ryzen ecosystem.
Such topics as Windows 7 vs. Windows 10 performance, SMT impact, and thread scheduling will no doubt still be debated, and AMD has correctly pointed out that optimization for this brand new architecture will only improve Ryzen performance going forward. Our own findings as to Ryzen and the Windows 10 thread scheduler appear to be validated as AMD officially dismisses performance impact in that area, though there is still room for improvement in other areas from our initial gaming performance findings. As mentioned in the post, AMD will have an update for Windows power plan optimization by the first week of April, and the company has "already identified some simple changes that can improve a game’s understanding of the 'Zen' core/cache topology, and we intend to provide a status update to the community when they are ready", as well.
It is refreshing to see a company publicly acknowledging the topics that have resulted in so much discussion in the past couple of weeks, and their transparency is commendable, with every issue (that this author is aware of) being touched on in the post.
** UPDATE 3/13 5 PM **
AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):
"We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."
"Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes."
So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:
"Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.
For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."
We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.
** END UPDATE **
Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout
Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing.
I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:
The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.
My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.
SMT OFF, 8 cores, 4 workers
With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.
Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:
SMT ON, 16 (logical) cores, 8 workers
With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.
Synthetic Testing Procedure
While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.
This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.
By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions - when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.
Click to Enlarge
This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.
If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well.
Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.
Click to Enlarge
Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.
The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.
Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.