Review Index:
Feedback

An Inside Look at Intel's NVM Solutions Group - SSD Validation Testing

Subject: Editorial, Storage
Manufacturer: Intel

Accelerated Testing

Controller:

With all of the development and design stuff out of the way, lets get into the good stuff. The first point to bring up is the fact that feedback is inherent in this process. The early validation steps may very well need to 'reach back' into the process in order to find solutions. Here's a good example:

View Full Size

Above is a bit error caused by a 'cosmic ray'. No, I'm not kidding. The earth is constantly bombarded by cosmic rays, many of them neutrons. Neutrons are mostly filtered out by the atmosphere, but we still see some of them down here on earth. We also make a bunch of them right here on earth. It doesn't just come from nuclear reactors, simply eating a banana causes your own body to emit neutrons as your body breaks down the potassium. While neutrons carry no electrical charge themselves, they 'excite' atoms they happen to run into along their path. Excited atoms don't like staying that way for long, so they quickly decay back down to a stable state. Part of that decay is in the form of an electron, which effects the charge of the surrounding atoms. If this event happens in a flash cell, the ECC mechanisms correct the error. The same ECC correction happens if a neutron happens to cause a bit to flip in the SRAM. Despite this, there are still places within an SSD where flipping a random bit of data can cause issues. Most of these take place within the controller itself, where a flipped bit can potentially cause data to be misrouted or not routed at all while still reporting to the host that it has been written (i.e. lost). In the worst cases, the controller might not be able to continue executing its firmware and would result in a soft reboot or even bricking of the device.

These cosmic ray events don't happen very often (we're talking billionths of a percent chance spread across thousands of devices), but they remain a possibility and do play into the design on the controller as a whole. Controllers tend to stick with the 'larger' lithography process nodes, so that the charge from a cosmic ray event has less of an effect on the overall voltage present at a given location. Extra checks are added to the firmware as a means of catching incorrect operations caused by flipped bits.

View Full Size

A failed SSD being analyzed on a test bench at Intel.

Now with all of these corrections in place, and with the chances of a neutron flipping a bit so low, we cant exactly put hundreds of thousands of unreleased SSDs out in an open field in the hopes of seeing failures happen. The process needs to be accelerated. "How", you might ask?

View Full Size

Just use an accelerator! Yes, Intel actually sends their prototype SSD controllers (among other things) out to Los Alamos to be bombarded with a neutron beam dozens of orders of magnitude higher than what they will see in normal use. These are literally tests to the point of failure. They then go back in and see what failed, how, and why. The results are again fed back into the design loop and the process is repeated if necessary after firmware (or even hardware) corrections have been made.

Flash:

Accelerating the testing of flash memory in a modern SSD is a tricky proposition. Thanks to advanced wear leveling techniques, writing via the normal method, at full speed, can take months or even years before flash blocks begin to wear to the point of noticeable failure. Tricks implemented from the outside really don't work. 'Short stroking' the SSD by writing to a smaller range of (external) LBA sectors does nothing, as the wear leveling algorithm will still spread those writes across the entire flash area (and this is why SSDs random write performance improves with greater over-provisioning at play - because there is more 'empty' flash to work with). Given the above, accelerating the wearout testing of flash requires a bit of a firmware tweak:

View Full Size

Now remember, we're trying to test the entire production unit here for any possible failures - in addition to flash failures. To accomplish this, Intel makes the most minimum possible modification to the firmware, instructing it to address only a portion of the flash dies within the SSDs. All data channels are still used and all flash dies are still accessed, but the addressable area of each is reduced to a fraction of the full surface. The diagram above depicts using the area at the edges of the dies, because this is where the failures are more likely to occur (due to handling and packaging). This effectively makes the SSD have a much smaller capacity, which means that writing at the same speed translates to increased wear to those focused areas. This is the same 'short stroking' mentioned above, but since it occurs at the die level, wear leveling is restricted to the same smaller area, and those smaller sections of flash can then be tested to failure within a reasonable amount of time (6 weeks in the example above).

March 21, 2014 | 08:16 AM - Posted by collie (not verified)

great article, super interesting read. However one this is irking me. The line is "The proof is in the EATING of the pudding" I know this is a stupid thing to bitch about but that one just gets on my tits, y'know?

March 21, 2014 | 12:17 PM - Posted by Allyn Malventano

Yes, but in the case of banana pudding, it makes you temporarily radioactive :)

March 21, 2014 | 12:03 PM - Posted by truk007

Wow! Impressive article. I love behind the scenes articles such as these. So much is involved in quality products.

March 21, 2014 | 02:12 PM - Posted by castlefox (not verified)

Allyn Malventano,

Great article, but I am wondering.... How much would that wafer be roughly worth if that stiff arm worked?

March 23, 2014 | 12:30 PM - Posted by Allyn Malventano

All depends on the yield for that particular wafer. A bad run and it wouldn't be worth much at all. Also, to know the true result of my 'stuff arm' success, check out the podcast some time :)

March 22, 2014 | 05:24 AM - Posted by Anonymous (not verified)

Hi

The "X25-M performance degradation bug discovered and reported by PC Perspective" was not discovered by PC Perspective ;-)

February 13, 2009

http://www.pcworld.fr/stockage/tests,ssd-intel-x25-m-80-go-une-bombe-pro...

<= September 8, 2008

http://www.hardware.fr/articles/731-6/supertalent-intel-performances-var...

<= September 8, 2009

In fact it doesn't take 3 month to fix this problem but ... 8 month ! ;-)

March 23, 2014 | 12:28 PM - Posted by Allyn Malventano

Interesting. We hadn't seen that piece, and neither had Intel apparently, as in the communications about their first firmware update, they credited us with its discovery. 

April 5, 2014 | 07:03 PM - Posted by Jimmy Jim (not verified)

Excellent article Allyn!

April 5, 2014 | 07:03 PM - Posted by Jimmy Jim (not verified)

Excellent article Allyn!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.