NAND Flash Memory - A Future Not So Bleak After All
A paper, titled “The Bleak Future of NAND Flash Memory” was recently jointly published by the University of California and Microsoft Research. It has been picked up by many media outlets who all seem to be beating the same morbid drum, spinning tales of a seemingly apocalyptic end to the reign of flash-based storage devices. While I agree with some of what these authors have to say, I have reservations about the methods upon which the paper is based.
TLC and beyond?
The paper kicks off by declaring steep increases in latency and drops in lifetime associated with increases in bits-per-cell. While this is true, flash memory manufacturers are not making large pushes to increase bits-per-cell beyond the standard MLC (2 bits per cell) tech. Sure some have dabbled in 3-bit MLC, also called Triple Level Cell (TLC) which is a bit of a misnomer since storing three bits in a cell actually requires eight voltage level bands, not three as the name implies. Moving from SLC to MLC doubles density, but the diminishing returns increase sharply after that – MLC to TLC only increases capacity by a another 1.5x, but sees a 2-4x reduction in performance and endurance. In light of this, there is little demand for TLC flash, and where there is, it’s clear by the usage cases that it is not meant for anything beyond light usage. There's nothing wrong with the paper going down this road, but the reality is that increasing bits per cell is not the envelope being pushed by the flash memory industry.
Wait a second – where is 25nm MLC?
Looking at the above we see a glaring omission – 25nm MLC flash, which has been around for close to two years now, and constitutes the majority of shipping flash memory parts currently in production. SLC was also omitted, but I can see the reason for this – it’s hard to get your hands on 25nm SLC these days. Why? Because MLC technology has been improved upon to the point where ‘enterprise MLC’ (eMLC) is rapidly replacing SLC even despite the supposed reduction in reliability and endurance over SLC. The reasons for this are simple, and are completely sidestepped or otherwise overlooked by the paper:
- SSD controllers employ write combination and wear leveling techniques.
- Some controllers even compress data on-the-fly as to further reduce writes and provisioning.
- Controller-level Error Correction (ECC) has improved dramatically with each process shrink.
- SSD controllers can be programmed to compensate for the drift of data stored in a cell (eMLC).
The paper continues with a hard push on a ‘TLC is bad’ mantra, and they even briefly mention four-bit-per-cell.
Wait, what's a Constant Die Count?!?!
The paper then moves on to place a broad reaching artificial limit on the methods of testing and evaluating the future iterations of flash memory technology. They assume a Prototypical SSD with a ‘Constant Die Count’, presuming that all flash controller architectures moving forward will be forced to use a fixed number of flash memory dies. This is, simply put, an absurd assumption. tantamount to limiting yourself to a single core and then writing a paper about ‘The Bleak Future of CPU’s’. To understand what’s going on here, lets recap the current hurdles seen by the flash memory industry:
- As the process shrinks, program (write) time and latency increase.
- As the process shrinks, write cycles (endurance) decreases.
…and now for the ways around these issues:
- Increasing parallelism (die count) effectively counters latency, especially when Native Command Queueing (NCQ) is properly employed to allow that parallelism to be effective as workloads increase.
- Increasing parallelism increases throughput (simple math).
- Increasing die count increases the available capacity, which increases endurance given a constant write workload (the writes are distributed over a greater area, therefore each cell is erased less often).
Given the above, fixing die count shoots all of these no-brainer architectural improvements in the foot. This rings even truer when you consider how the cost of each chip (full of dies) will drop over the coming years. Don’t take my word for it – this pic is from the paper itself:
Even without increasing die count, each die shrink doubles capacity, so even with the Constant Die Count restriction, longevity is effectively maintained. Also, with these die shrinks have come other improvements to flash architecture – both to increase throughput and decrease latency. Another thing sidestepped by the paper is the difference in quality of flash memory. This varies widely among manufacturers, and was highlighted in a recent presentation at the Flash Memory Summit, where several different flash memory types were tested. While some flash actually performed below spec, others performed well above it.
2x nm MLC flash outperforming its rated spec.
If you flip through that presentation yourself, you will see an ironic twist: many of the larger feature size parts (3x nm / 5x nm) performed below their specification. This is because as manufacturers have shrunk the dies, they have simultaneously improved upon the process. Further, a given process node is improved upon as it is executed within its life cycle. For example, Intel 25nm flash did not appear in consumer and and finally enterprise devices for months after its manufacture begun. This is because while CPU bases its yield on more of a go / no-go test (i.e. it either works or it doesn't), flash memory has an added metric of quality. As the plant fine tunes their production, not only do they get a higher yield, but they get higher enduring parts as well. Where CPU's are binned according to their maximum speed, flash memory is binned according to its speed and its relative quality (endurance). IMFT announced 20nm flash last year, but it's not in shipping products because the quality and endurance are not yet mature enough for mass production and use.
The point I want all of you to take home here is that just as with the CPU, RAM, or any other industry involving wafers and dies, the manufacturers will adapt and overcome to the hurdles they meet. There is always another way, and when the need arises, manufacturers will figure it out. The main drivers of this paper were TLC (and higher), and a CDC controller architecture. Nobody else really takes TLC seriously, especially for enterprise use, and no SSD manufacturer worth their salt is going to keep the die counts fixed. Both of these fallacies effectively negate the entire premise of the paper.
To be clear – I am not blind to the issues flash memory faces moving forward. That said, I can’t understand what the motives of this paper are, or why they overlooked important points and skewed their results so severely, but to me it is just bad science. If you want to make a point, don’t overreach to the point of undermining your message.