Ryzen Locking on Certain FMA3 Workloads

Subject: Processors | March 15, 2017 - 05:51 PM |
Tagged: ryzen, Infinity Fabric, hwbot, FMA3, Control Fabric, bug, amd, AM4

Last week a thread was started at the HWBOT forum and discussed a certain workload that resulted in a hard lock every time it was run.  This was tested with a variety of motherboards and Ryzen processors from the 1700 to the 1800X.  In no circumstance at default power and clock settings did the processor not lock from the samples that they have worked on, as well as products that contributors have been able to test themselves.

View Full Size

This is quite reminiscent of the Coppermine based Pentium III 1133 MHz processor from Intel which failed in one specific workload (compiling).  Intel had shipped a limited number of these CPUs at that time, and it was Kyle from HardOCP and Tom from Tom’s Hardware that were the first to show this behavior in a repeatable environment.  Intel stopped shipping these models and had to wait til the Tualatin version of the Pentium III to be released to achieve that speed (and above) and be stable in all workloads.

The interesting thing about this FMA3 finding is that it is seen to not be present in some overclocked Ryzen chips.  To me this indicates that it could be a power delivery issue with the chip.  A particular workload that heavily leans upon the FPU could require more power than the chip’s Control Fabric can deliver, therefore causing a hard lock.  Several tested overclocked chips with much more power being pushed to them seems as though enough power is being applied to the specific area of the chip to allow the operation to be completed successfully.

This particular fact implies to me that AMD does not necessarily have a bug such as what Intel had with the infamous F-Div issue with the original Pentium, or AMD’s issue with the B2 stepping of Phenom.  AMD has a very complex voltage control system that is controlled by the Control Fabric portion of the Infinity Fabric.  With a potential firmware or microcode update this could be a fixable problem.  If this is the case, then AMD would simply increase power being supplied to the FPU/SIMD/SSE portion of the Ryzen cores.  This may come at a cost through lower burst speeds to keep TDP within their stated envelope.

A source at AMD has confirmed this issue and that a fix will be provided via motherboard firmware update.  More than likely this comes in the form of an updated AGESA protocol.

 
Source: HWBOT Forums

March 15, 2017 | 01:47 PM - Posted by IntelShekel (not verified)

so what's the intel shekel to us dollar currency conversion these days?

asking for a friend.

March 15, 2017 | 02:01 PM - Posted by Jeremy Hellstrom

Imaginary currency can be whatever you want it to be.  Dream us up a tremendous life, would you?

March 15, 2017 | 07:52 PM - Posted by IntelShekel (not verified)

what would a candian know about currency?

asking for another friend.

March 16, 2017 | 03:17 AM - Posted by Martin Trautvetter

A lot. They're carrying it by the barrel load.

March 15, 2017 | 06:14 PM - Posted by Anonymous (not verified)

Wow its crazy that theres somebody here in the comments making the same pathetic antisemetic jokes as that loser MasterChen. Apparently even reporting NEWS from community sources that doesn't bow down before AMDesus makes you an Intel shill these days.

March 15, 2017 | 06:41 PM - Posted by Jeremy Hellstrom

You might have missed the comment where he implied Robert Hallock was one.

March 15, 2017 | 06:57 PM - Posted by Anonymous (not verified)

So rich... literally "AMD Tech Evangelist" on linkedin

March 15, 2017 | 07:28 PM - Posted by pdjblum

Jeremy, he implied that Hallock "was one" what, a jew?

is that a bad thing to imply?

Master Chen, as long as you do not plan on killing us or inciting people to kill us, it is your god given right to hate the chosen people, but I still think we could be friends.

March 15, 2017 | 07:45 PM - Posted by Jeremy Hellstrom

swing and a miss, schmuck

March 15, 2017 | 08:04 PM - Posted by pdjblum

are you calling me a "penis?"

what did you mean 'was one?"

that sounds fucking bigoted to me

see, even the nicest, sweetest guys, have it in their dna

March 16, 2017 | 01:02 PM - Posted by Jeremy Hellstrom

The comment I replied to was about Intel shills, as you well know.

Once again, I will state clearly.  This is a tech site, not a forum in which to discuss religion, politics or even bathing habits. 

As the children are getting out of hand, comments which have nothing to do with the posted article or tech in general are now going to start disappearing, even if it kills decent comments below the offending one.

March 16, 2017 | 09:42 PM - Posted by Anonymous (not verified)

Can't you just fill the comment with a deleted place keeper and keep any non violating comments intact!

March 17, 2017 | 01:39 PM - Posted by Jeremy Hellstrom

Don't feed the trolls if you are worried about your comments being removed because of someone you replied to.

March 15, 2017 | 08:03 PM - Posted by oaklund8

How are there so many trolls in comment threads now? Not sure if it's multiple people or just one really lonely sad person, but it's getting ridiculous. Maybe you need an auto-mod that replaces the words "shekels" with "I secretly Love Josh"

March 15, 2017 | 09:36 PM - Posted by AT forums are the worst (not verified)

It's because PCper doesn't require a valid account to register. And they've become popular targets at places like AT.

March 15, 2017 | 10:26 PM - Posted by kenjo

it's getting worse and worse and yes one moron is posting under different names. At some point it's going to start hurting the site for real as people leave it.

March 16, 2017 | 01:46 AM - Posted by Anonymous (not verified)

I wouldnt stop reading pcper just because of some shithead trolls in the comment section. If anything it gets more views as people browse the technologically illiterate fuckwit's posts!

March 16, 2017 | 09:13 AM - Posted by Anonymously Anonymous (not verified)

Seriously PCPer, what is it going to take to make you guys either get rid of the shared anon account and/or have more moderators patrol the comments more frequently?

Or do you guys simply just get your jollies by reading the ugliness posted here?

Despite what others say, the crap that you guys let stay posted, AND THEN YOU EVEN RESPOND TO SOME OF IT and and still leave it up really makes the site look less professional than what it could be.

March 16, 2017 | 09:15 AM - Posted by Anonymously Anonymous (not verified)

Don't get me wrong here, I really do like a lot of the content you guys produce, HOWEVER, I believe that having that ugliness patrolled a lot better(read previous comment) to keep it to a very bare minimum would go a long way to improve people's perception and return traffic to your site.

March 15, 2017 | 06:23 PM - Posted by willmore

Reminds me of the SkyLake bug:
http://www.mersenneforum.org/showthread.php?t=20714

Except that silently produced corrupted data instead of instantly crashing the machine. Neither are ideal.

March 15, 2017 | 06:31 PM - Posted by Jann5s

Is that a fan made box Josh?

lol: "Exactly what you wanted guaranteed"

reminded me of: http://www.gamespot.com/articles/fallout-fan-who-sent-2000-bottle-caps-t...

March 15, 2017 | 07:39 PM - Posted by Anonymous (not verified)

so is the Ryzen 5 1500x going to be slower in gaming than the pentium g3258 overclocked?

i'm guessing yes.

1600x slower than i3 7350K in gaming?

again guessing yes.

March 15, 2017 | 08:36 PM - Posted by Mr.Gold (not verified)

Whats you average FPS on the g3258 and Dragon Age ?

March 16, 2017 | 12:00 AM - Posted by Anonymous (not verified)

sorry, i'm ivy bridge i7(K) which i bought for $230 when haswell came out. I skipped the g3258 because i was messing around with rpi's as my side projects. I almost got one though because motherboards + that cpu were $100 at one point in time. Anyways, looks like that AMD 1400 might be a decent gaming cpu better than the g3258, so i retract that second guess.

March 15, 2017 | 08:35 PM - Posted by Mr.Gold (not verified)

Cool find...
I just got a R7 1700 box running, will try this to confirm.
But then again, I overclocked it. But it might still be valid as I just raised the multiplier , not the voltage.
So it should make it lock up ever quicker ?

March 15, 2017 | 09:19 PM - Posted by Anonymous (not verified)

"AGESA

AMD Generic Encapsulated Software Architecture (AGESA), is a bootstrap protocol by which system devices on AMD64-architecture mainboards are initialized. The AGESA software in the BIOS of such mainboards is responsible for the initialization of the processor cores, memory, and the HyperTransport controller.

AGESA documentation was previously available only to AMD partners that had signed a non-disclosure agreement (NDA). AGESA source code was open-sourced in early 2011 to gain track in coreboot.[1]

AGESA is linked to AMD PowerTune.[2]" (1)

"AGESA"

https://en.wikipedia.org/wiki/AGESA

March 15, 2017 | 09:42 PM - Posted by Anonymous (not verified)

Also that:

"initialization of the processor cores, memory, and the HyperTransport controller."

wording it's very interesting as AMD's Infinity Fabric is supposed to be controlled(ISMU) in the power delivery/control subsystems(SenseMI: Pure Power/Precision Boost) on Ryzen along with providing the cache coherency over the data fabric(IF) functionality.

So Just how much Tweaking via any controller microcode/UEFI fireware does Ryzen have for all sorts of issues that may be solvable with some firmware updates!

March 16, 2017 | 01:17 AM - Posted by Anonymous (not verified)

HWBOT forum
"The issue with Flops was found and fixed in the beginning of february.
The current ┬Ácode version dates to 01/27/2017, so the fix is obviously not included yet (due to the time required for validation).
Flops is only affected when the SMT is enabled, so disabling the SMT can be used as a temporary work-around (until the actual fix arrives)."

March 16, 2017 | 01:55 AM - Posted by Anonymous (not verified)

Is there any public errata for Ryzen available, so other such known errors can be found without users having to find it by testing?

March 16, 2017 | 04:50 AM - Posted by Anonymous (not verified)

At least, not in AMD Tech Docs listing.

March 16, 2017 | 04:37 AM - Posted by Anonymous (not verified)

Well, at least it's fixable. Unlike Haswells TSX bug. Where they just had to switch it off and redo it for Broadwell.

March 16, 2017 | 05:02 AM - Posted by Klimax (not verified)

Small difference: FMA3 is old technology (SIMD) and relatively easy problem. TSX was new and targeting one of hardest problems.

Bit incomparable. Author made better call by including FDIV.

March 16, 2017 | 11:47 AM - Posted by Anonymous (not verified)

TSX was very much like 16-bit only Intel 386.
Even at current state AMD FMA3 bug is way less serious then Intel FDIV.

March 21, 2017 | 12:05 AM - Posted by Anonymous (not verified)

^What he said.

March 17, 2017 | 05:26 AM - Posted by Anonymous (not verified)

Haswell binary = force 256bit FMA = 256FMA>128x2>256FMA

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.