Review Index:
Feedback

An Inside Look at Intel's NVM Solutions Group - SSD Validation Testing

Subject: Editorial, Storage
Manufacturer: Intel

More on Design

Diving a bit deeper on how Intel's design factors into the reliability of their SSDs, we'll look at a few additional factors. Starting with the reliability of the data stored and retrieved from the NAND dies themselves:

View Full Size

In addition to the standard quality control methods in place for NAND production, Intel has also designed in mechanisms that assist in the graceful recovery of errors when they do occur. All storage methods have error rates within the raw stored data, but the trick is effectively using error correction methods to ensure the data is correctly reassembled while being read. The additional methods Intel employs into their higher end devices gives better chances of successful reads when there are too many errors present on the flash to correct with the on-die ECC data. These methods include adjusting the threshold voltages for the various logic states saved into the flash cells and re-attempting the read. This is the flash memory equivalent of adjusting tape speed on a data cassette, or head fly height on a hard disk, or laser power and focus on an optical drive. It provides the ability to compensate for things like flash cell charges slowly drifting over time - an effect that is further amplified as the cell oxide insulation is degraded by repeated erasures (i.e. writes). Here's a visual:

View Full Size

As you might imagine, if the voltages across a chunk of flash have shifted over time, and you can't compensate to correct for it, you'll end up with potentially uncorrectable sections of data. Shifting the thresholds on-the-fly gives you a much better chance of reading the data. This is in essence a form of automatic forensic data recovery taking place within the SSD at the time of a read error. Intel's technology can even do successive retries with varying threshold levels as a means of recovering data that might have been lost without such technology at play. Here's a taste of the end result, as applied to a chart using data from the High Endurance Technology (HET) flash used starting with the SSD 710:

View Full Size

As you can see above, this compensation not only extends the usable lifetime of the flash, it also gives you a large margin earlier in life, which translates to greater reliability and retention of the stored data.

Now that we know how data is more reliably stored on and read from the NAND dies themselves, lets touch on the rest of the path:

View Full Size

As you can see above, there are more links in the chain from host to flash. Data is protected by various parity / encoding / CRC along the way. SATA uses CRC as well as 8b/10b encoding at the physical layer to prevent the possibility of data corruption. Once inside the SSD, ECC-protected buffers ensure data in flight remains protected until it reaches the flash. It's not clear in the image, but the buswork between the nested protection blocks is actually protected itself by additional parity lines (i.e. an 8-bit data bus would actually have >8 physical lines). These principles apply to most SSDs at some level, with differences varying by manufacturer and controller design.

With the NAND storage and data transfer conduits covered, lets zoom out a bit further to the things that keep the SSDs logical storage available over time:

View Full Size

You might have known that hard disk drives have spare sectors reserved for use and are swapped in seamlessly as it detects sectors that are no longer reliable. These sectors are invisible to the host system, and remain so even as they are put into use. The HDD simply maintains a map of bad sectors and references the associated spare when addressed. The alternative would be a volume that effectively 'shrinks' as blocks fail, which actually used to happen years ago. 'Bad blocks' used to be discovered during a format (or later during use), mapped out at the partition level, never to be used again. That is no longer an acceptable practice with modern storage devices.​

Therefore SSDs do the same thing as hard drives by artificially reducing the available flash to create a 'spare area' as a reserve to only be used as the controller decommissions blocks of flash that have exceeded a bit-error threshold (i.e. considered bad). Intel calls this the 'spare tire' as seen in the above slide. This is also the reason you see the difference in stated capacities of various SSDs. In the above example, 256GB of flash is present, but 16GB of it is reserved for other purposes, 8GB of that being spare area. These ratios vary from capacity to capacity (i.e. the 480GB model can likely support a larger portion of that reserved area assigned to spare blocks). The availability of these spare blocks means that the SSD will continue to be presented to the OS as a constant size.

March 21, 2014 | 05:16 AM - Posted by collie (not verified)

great article, super interesting read. However one this is irking me. The line is "The proof is in the EATING of the pudding" I know this is a stupid thing to bitch about but that one just gets on my tits, y'know?

March 21, 2014 | 09:17 AM - Posted by Allyn Malventano

Yes, but in the case of banana pudding, it makes you temporarily radioactive :)

March 21, 2014 | 09:03 AM - Posted by truk007

Wow! Impressive article. I love behind the scenes articles such as these. So much is involved in quality products.

March 21, 2014 | 11:12 AM - Posted by castlefox (not verified)

Allyn Malventano,

Great article, but I am wondering.... How much would that wafer be roughly worth if that stiff arm worked?

March 23, 2014 | 09:30 AM - Posted by Allyn Malventano

All depends on the yield for that particular wafer. A bad run and it wouldn't be worth much at all. Also, to know the true result of my 'stuff arm' success, check out the podcast some time :)

March 22, 2014 | 02:24 AM - Posted by Anonymous (not verified)

Hi

The "X25-M performance degradation bug discovered and reported by PC Perspective" was not discovered by PC Perspective ;-)

February 13, 2009

http://www.pcworld.fr/stockage/tests,ssd-intel-x25-m-80-go-une-bombe-pro...

<= September 8, 2008

http://www.hardware.fr/articles/731-6/supertalent-intel-performances-var...

<= September 8, 2009

In fact it doesn't take 3 month to fix this problem but ... 8 month ! ;-)

March 23, 2014 | 09:28 AM - Posted by Allyn Malventano

Interesting. We hadn't seen that piece, and neither had Intel apparently, as in the communications about their first firmware update, they credited us with its discovery. 

April 5, 2014 | 04:03 PM - Posted by Jimmy Jim (not verified)

Excellent article Allyn!

April 5, 2014 | 04:03 PM - Posted by Jimmy Jim (not verified)

Excellent article Allyn!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.