Ryzen Locking on Certain FMA3 Workloads

Subject: Processors | March 15, 2017 - 05:51 PM |
Tagged: ryzen, Infinity Fabric, hwbot, FMA3, Control Fabric, bug, amd, AM4

Last week a thread was started at the HWBOT forum and discussed a certain workload that resulted in a hard lock every time it was run.  This was tested with a variety of motherboards and Ryzen processors from the 1700 to the 1800X.  In no circumstance at default power and clock settings did the processor not lock from the samples that they have worked on, as well as products that contributors have been able to test themselves.

ryzen.jpg

This is quite reminiscent of the Coppermine based Pentium III 1133 MHz processor from Intel which failed in one specific workload (compiling).  Intel had shipped a limited number of these CPUs at that time, and it was Kyle from HardOCP and Tom from Tom’s Hardware that were the first to show this behavior in a repeatable environment.  Intel stopped shipping these models and had to wait til the Tualatin version of the Pentium III to be released to achieve that speed (and above) and be stable in all workloads.

The interesting thing about this FMA3 finding is that it is seen to not be present in some overclocked Ryzen chips.  To me this indicates that it could be a power delivery issue with the chip.  A particular workload that heavily leans upon the FPU could require more power than the chip’s Control Fabric can deliver, therefore causing a hard lock.  Several tested overclocked chips with much more power being pushed to them seems as though enough power is being applied to the specific area of the chip to allow the operation to be completed successfully.

This particular fact implies to me that AMD does not necessarily have a bug such as what Intel had with the infamous F-Div issue with the original Pentium, or AMD’s issue with the B2 stepping of Phenom.  AMD has a very complex voltage control system that is controlled by the Control Fabric portion of the Infinity Fabric.  With a potential firmware or microcode update this could be a fixable problem.  If this is the case, then AMD would simply increase power being supplied to the FPU/SIMD/SSE portion of the Ryzen cores.  This may come at a cost through lower burst speeds to keep TDP within their stated envelope.

A source at AMD has confirmed this issue and that a fix will be provided via motherboard firmware update.  More than likely this comes in the form of an updated AGESA protocol.

Source: HWBOT Forums