Is it better to freeze or give bad results? Skylake and complex math need to learn to get along

Subject: General Tech | January 11, 2016 - 03:00 PM |
Tagged: Intel, Skylake, bug

You may remember the infamous Pentium FDIV bug, which could cause the wrong decimal results to be given in an answer to complex mathematical calculations which caused much consternation among scientists in the early 90's.  Now there is a new bug to remember, found on Skylake processors, which can cause the processor to freeze during complex calculations such as you would do in Prime95 or if you contribute to the Great Internet Mersenne Prime Search project.  The issue has been replicated on both Windows and Linux systems and on different motherboards, signifying that the issue does indeed come from the CPU.  While having a freeze is certainly better than getting an incorrect result, it is still inconvenient and we hope that Intel's BIOS update will arrive soon.  You can follow the detection and investigation of the bug and what is being done over at The Register.

View Full Size

"The good news is that the bug's triggered by complex workloads. It was turned up by prime number experts the Great Internet Mersenne Prime Search (GIMPS), who use Intel machines to identify and test new large prime numbers."

Here is some more Tech News from around the web:

Tech Talk

Source: The Register

Video News


January 11, 2016 | 03:20 PM - Posted by kenjo

I really want to know what the impact of the fix really is.

The fix for FDIV was to turn of the floating point unit and that is a huge impact on performance.

the fix for TSX problem was no more TSX but since almost no code used that it was not a big issue.

what is the fix this time?

January 11, 2016 | 04:47 PM - Posted by Jeremy Hellstrom

waiting on the official word as of yet.

January 11, 2016 | 05:02 PM - Posted by willmore (not verified)

Two things:

1) the code being run is older AVX (introduced in SandyBridge) and not the new FMA3 code. So, if the fix slows down AVX, it shouldn't effect FMA3--or at least one hopes.

2) we still don't know quite what is happening.

The code that fails does so because it has explicit checks in it for possible hardware flaws. This is important because the code can run for *months* to perform one calculation. Any little slip up in the middle can waste all of that effort, so there are periodic checks to make sure nothing went wrong part way through--and if they did, the code can roll back to a last known good state and start back up.

There may be code out there in the wild that is effected by this bug and we just don't know about it because it doesn't have similar checks in it. I hope Intel is forthcoming about what the problem was after they issue the microcode fix--actually, the fix is supposed to be in OEM hands already.

January 11, 2016 | 05:17 PM - Posted by Anonymous (not verified)

It says that the machine freezes, not that it fails a check. Are you saying that it is freezing because it is running checks? I would be interested to know what the issue is. The 14 nm process still seems to have issues. The higher clocked parts still seem to be in short supply. I don't know if a hot spot or other heat issue would fit the problem.

January 11, 2016 | 05:58 PM - Posted by willmore (not verified)

The article is incorrect and is written by someone unfamilair with the problem. The problem was first reported at the Mersenne forum (www.mersenneforum.org). I've been a member there since 2008 when we moved off the mailing list and to the forum.

To clarify, there is no *crashing* nor *freezing*. The program throws an error when it detects a problem and stops working on the selftest. This is what it's supposed to. I guess that doesn't make as exciting of a headline. :)

Overclockers were the first to detect the issue, but they tried the normal route to solving overclocking issues--increasing voltages, backing off clocks, etc. They even underclocked the chips and the problem persists. It's a logic problem on the chip, not likely something to do with the lithography.

They also ruled out thermal issues--they were running on water at 50C under full load and still had failures.

January 11, 2016 | 07:29 PM - Posted by Jeremy Hellstrom

Thanks for the extra info .. honestly I don't play with Mersenne so I am not as familiar with the issue as I could be.

That said, from the sounds of it we are still not sure of the exact cause.

January 11, 2016 | 08:40 PM - Posted by willmore (not verified)

You're welcome. I did a longer post over at Tech Report if you want more info on it. You're right in that we don't yet know exactly what goes wrong--just sort of how to make it do so.

I'm UID 20 over at the MersenneForums, so I do have a pretty good understanding of what's going on at least with Prime95.

I've never put a link in to a post here before, so bear with me...

http://techreport.com/news/29585/prime95-can-cause-intel-skylake-cpus-to...

January 12, 2016 | 01:30 PM - Posted by Jeremy Hellstrom

No worries, we like informative links.

January 11, 2016 | 08:39 PM - Posted by Anonymous (not verified)

Thanks for the info. It doesn't sound like they have narrowed it down to a specific instruction or sequence of instructions yet. I wouldn't expect a single instruction error to have made it through testing. Perhaps it comes out in a unique instruction sequence. I don't know the mersenne prime algorithm, but I would guess it involves arbitrary precision arithmetic. I can look it up later. It could be a small number of people that are effected, as this is probably even more specialized than the FDIV bug. Hopefully encryption code isn't effected; I would think that would have been noticed already if that was the case. It would be better if it did crash instead of producing incorrect results. With very long running code that supports check points, they sometimes run the same code twice, on different processors to insure the accuracy of the results. Depending on the algorithm, there are other checks that can be done. Although, did it fail the check, or did the checking code itself throw an exception or something?

January 11, 2016 | 08:51 PM - Posted by willmore (not verified)

I wrote more over at TR, check the link above for that.

To answer your question and your inferences, yes, Mersenne testing uses arbitrary precision arithmatic implemented as a special form of FFT. The FFT from Prime95 is used in other projects, so if that is where the mistake comes in then there are a good number of other programs that might have similar bugs.

It is extremely unlikely that this will effect any form of encryption that I am familiar with. Even 4K RSA math is tiny compared to the numbers in Mersenne calculations.

The Great Internet Mersenne Prime Search (GIMPS) as a whole does use double checks run on different machines. When a potential Mersenne Prime is found a double check is run on a different machine and on a third machine of a completely different processor architecture using different code. That's the computational version of "take off and nuke it from orbit" for any potential math or processor bugs. I'm familiar with this process because I did that check for the 38th discovered Mersenne Prime.

As a single test, Prime95 checkpoints and tests intermediate results just like you suggest a long running program should.

In the case of this error, Prime95 detected the error and displayed an error message. If this had been an actual real test, it would have rolled back and retried a few different times before failing the test. Since it was just a stress test, it displayed the error and stopped the self test immediately. Since the point of the selftest is to detect errors with a machine, not to actually test for primality.

January 11, 2016 | 09:05 PM - Posted by Anonymous (not verified)

I don't know if encryption code will even hit the same units in a modern processor due to encryption specific instructions. It is interesting that hyper-threading seems to be necessary to trigger the bug. For consumer applications, I haven't really seen that much of a need for it, so my recommendation has mostly been to just go with an i5.

January 11, 2016 | 05:15 PM - Posted by Anonymous (not verified)

Let me guess... the end of over-clocking non K Skylake CPUSs?

January 11, 2016 | 08:43 PM - Posted by Anonymous (not verified)

It is probably something that is unlikely to occur in most consumer code.

January 11, 2016 | 08:52 PM - Posted by willmore (not verified)

I may have read my own thoughts into the OPs question, but I think they meant "Will take this opportunity for a manditory microcode patch to stick in something that combats the non-K model overclocking?"

That seems like a reasonable fear.

January 11, 2016 | 09:03 PM - Posted by anon (not verified)

Intel Buglake

January 11, 2016 | 10:16 PM - Posted by Anonymous (not verified)

It is AMD's fault. Intel is the most reliable company in the world, and Intel fanboys always trust Intel and NVidia stuff.

I do not care because I do not have money to buy the new stuff.

So no big deal as far as I am concerned.

January 12, 2016 | 06:29 AM - Posted by Master Chen (not verified)

That is one beautiful bug...

January 12, 2016 | 01:28 PM - Posted by Jeremy Hellstrom

I was trying to find one in a vacuum tube :(

January 13, 2016 | 05:16 PM - Posted by Anonymous (not verified)

How about vacuum tubes that ARE bugs?

https://c2.staticflickr.com/4/3805/9278710144_9c6979daaf_b.jpg

January 13, 2016 | 06:06 PM - Posted by Jeremy Hellstrom

nice

January 12, 2016 | 09:50 PM - Posted by Yorgos (not verified)

"complex workloads"
I am sorry, but there are no complex workloads.
You have either integer, float or bitwise workloads.
Complex = 2 floats.
Who tells me that it's not their FPUs that don't have the problem?

I'd like to see some server Skylake mobos do firmware updates!!

January 30, 2016 | 07:18 AM - Posted by xnor

Just don't run complex workloads on servers, LOL!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.