Review Index:
Feedback

Intel SSD 660p 1TB SSD Review - QLC Goes Mainstream

Subject: Storage
Manufacturer: Intel
Tagged: ssd, SMI, QLC, Intel, 660p, 512GB, 3d nand, 2TB, 1TB

PC Perspective Custom SSD Test Suite Introduction

Back in late 2016, we implemented a radically new test methodology. I'd grown tired of making excuses for benchmarks not meshing well with some SSD controllers, and that matter was amplified significantly by recent SLC+TLC hybrid SSDs that can be very picky about their workloads and how they are applied. The complexity of these caching methods has effectively flipped the SSD testing ecosystem on its head. The vast majority of benchmarking software and test methodologies out there were developed based on non-hybrid SLC, MLC, or TLC SSDs. All of those types were very consistent once a given workload was applied to them for long enough to reach a steady state condition. Once an SSD was properly prepared for testing, it would give you the same results all day long. No so for these new hybrids. The dynamic nature of the various caching mechanisms at play wreaks havoc on modern tests. Even trace playback testing such as PCMark falter, as the playback of traces is typically done with idle gaps truncated to a smaller figure in the interest of accelerating the test. Caching SSDs rely on those same idle time gaps to flush their cache to higher capacity areas of their NAND. This mix up has resulted in products like the Intel SSD 600p, which bombed nearly all of the ‘legacy’ benchmarks yet did just fine once tested with a more realistic, spaced out workload.

To solve this, I needed a way to issue IO's to the SSD the same way that real-world scenarios do, and it needed to be in such a way that did not saturate the cache of hybrid SSDs. The answer, as it turned out, was staring me in the face.

View Full Size

Latency Percentile made its debut in October of 2015 (ironically, with the 950 PRO review), and those results have proven to be a gold mine that continues to yield nuggets as we mine the data even further. Weighing the results allowed us to better visualize and demonstrate stutter performance even when those stutters were small enough to be lost in more common tests that employ 1-second averages. Merged with a steady pacing of the IO stream, it can provide true Quality of Service comparisons between competing enterprise SSDs, as well as high-resolution industry-standard QoS of saturated workloads. Sub-second IO burst throughput rates of simultaneous mixed workloads can be determined by additional number crunching. It is this last part that is the key to the new test methodology.

The primary goal of this new test suite is to get the most accurate sampling of real-world SSD performance possible. This meant evaluating across more dimensions than any modern benchmark is capable of. Several thousand sample points are obtained, spanning various read/write mixes, queue depths, and even varying amounts of additional data stored on the SSD. To better quantify real-world performance of SSDs employing an SLC cache, many of the samples are obtained with a new method of intermittently bursting IO requests. Each of those thousands of samples is accompanied by per-IO latency distribution data, and a Latency Percentile is calculated (for those counting, we’re up to millions of data points now). The Latency Percentiles are in turn used to derive the true instantaneous throughput and/or IOPS for each respective data point. The bursts are repeated multiple times per sample, but each completes in less than a second, so even the per-second logging employed by some of the finer review sites out there just won’t cut it.

View Full Size

Would you like some data with your data? Believe it or not, this is a portion of an intermittent calculation step - the Latency Percentile data has already been significantly reduced by this stage.

Each of the many additional dimensions of data obtained by the suite is tempered by a weighting system. Analyzing trace captures of live systems revealed *very* low Queue Depth (QD) under even the most demanding power-user scenarios, which means some of these more realistic values are not going to turn in the same high queue depth ‘max’ figures seen in saturation testing. I’ve looked all over, and nothing outside of benchmarks maxes out the queue. Ever. The vast majority of applications never exceed QD=1, and most are not even capable of multi-threaded disk IO. Games typically allocate a single thread for background level loads. For the vast majority of scenarios, the only way to exceed QD=1 is to have multiple applications hitting the disk at the same time, but even then it is less likely that those multiple processes will be completely saturating a read or write thread simultaneously, meaning the SSD is *still* not exceeding QD=1 most of the time. I pushed a slower SATA SSD relatively hard, launching multiple apps simultaneously, trying downloads while launching large games, etc. IO trace captures performed during these operations revealed >98% of all disk IO falling within QD=4, with the majority at QD=1. Results from the new suite will contain a section showing a simple set of results that should very closely match the true real-world performance of the tested devices.

While the above pertains to random accesses, bulk file copies are a different story. To increase throughput, file copy routines typically employ some form of threaded buffering, but it’s not the type of buffering that you might think. I’ve observed copy operations running at QD=8 or in some cases QD=16 to a slower destination drive. The catch is that instead of running at a constant 8 or 16 simultaneous IO’s as you would see with a saturation benchmark, the operations repeatedly fill and empty the queue, meaning the queue is filled, allowed to empty, and only then filled again. This is not the same as a saturation benchmark, which would constantly add requests to meet the maximum specified depth. The resulting speeds are therefore not what you would see at QD=8, but actually, a mixture of all of the queue steps from one to eight.

Conditioning

Some manufacturers achieve unrealistic ‘max IOPS’ figures by running tests that place a small file on an otherwise empty drive, essentially testing in what is referred to fresh out of box (FOB) condition. This is entirely unrealistic, as even the relatively small number of files placed during an OS install is enough to drop performance considerably from the high figures seen with a FOB test.

On the flip side, when it comes to 4KB random tests, I disagree with tests that apply a random workload across the full span of the SSD. This is an enterprise-only workload that will never be seen in any sort of realistic client scenario. Even the heaviest power users are not going to hit every square inch of an SSD with random writes, and if they are, they should be investing in a datacenter SSD that is purpose-built for such a workload.

View Full Size

Calculation step showing full sweep of data taken at multiple amounts of fill.

So what’s the fairest preconditioning and testing scenario? I’ve spent the past several months working on that, and the conclusion I came to ended up matching Intel’s recommended client SSD conditioning pass, which is to completely fill the SSD sequentially, with the exception of an 8GB portion of the SSD meant solely for random access conditioning and tests. I add a bit of realism here by leaving ~16GB of space unallocated (even those with a full SSD will have *some* free space, after all). The randomly conditioned section only ever sees random, and the sequential section only ever sees sequential. This parallels the majority of real-world access. Registry hives, file tables, and other such areas typically see small random writes and small random reads. It’s fair to say that a given OS install ends up with ~8GB of such data. There are corner cases where files were randomly written and later sequentially read. Bittorrent is one example, but since those files are only laid down randomly on their first pass, background garbage collection should clean those up so that read performance will gradually shift towards sequential over time. Further, those writes are not as random as the more difficult workloads selected for our testing. I don't just fill the whole thing up right away though - I pause a few times along the way and resample *everything*, as you can see above.

View Full Size

Comparison of Saturated vs. Burst workloads applied to the Intel 600p. Note the write speeds match the rated speed of 560 MB/s when employing the Burst workload.

SSDs employing relatively slower TLC flash coupled with a faster SLC cache present problems for testing. Prolonged saturation tests that attempt to push the drive at full speeds for more than a few seconds will quickly fill the cache and result in some odd behavior depending on the cache implementation. Some SSDs pass all writes directly to the SLC even if that cache is full, resulting in a stuttery game of musical chairs as the controller scrambles, flushing SLC to TLC while still trying to accept additional writes from the host system. More refined implementations can put the cache on hold once full and simply shift incoming writes directly to the TLC. Some more complicated methods throw all of that away and dynamically change the modes of empty flash blocks or pages to whichever mode they deem appropriate. This method looks good on paper, but we’ve frequently seen it falter under heavier writes, where SLC areas must be cleared so those blocks can be flipped over to the higher capacity (yet slower) TLC mode. The new suite and Burst workloads give these SSDs adequate idle time to empty their cache, just as they would have in a typical system. 

Apologies for the wall of text. Now onto the show!

Video News


August 7, 2018 | 12:29 PM - Posted by Mobile_Dom

holy shitballs this is pretty impressive.

I cant wait to see this with Samsung Controllers and NAND. Though, at 20c per gig, getting one of these on a sale will be an insane steal.

August 7, 2018 | 02:55 PM - Posted by Allyn Malventano

Totally with you on price. Intel has undercut the market a few times in the past and I’m happy to see them doing it again. I’d also like to see Samsung come down to this same price point.

February 12, 2019 | 07:44 AM - Posted by Fred (not verified)

Thought that too, when it came out. But now it's down to around 10c per gig, at least in Germany, which it should have launched at. Now I'm definitely considering getting one, but I might wait for a sale since I'm stingy.

August 7, 2018 | 12:30 PM - Posted by TropicMike (not verified)

You forgot an edit - Toshiba:

PC Perspective Compensation: Neither PC Perspective nor any of its staff was paid or compensated in any way by Toshiba for this review.

August 7, 2018 | 01:38 PM - Posted by protoCJ

Well, I'd be surprised if Toshiba paid for this review.

August 7, 2018 | 02:48 PM - Posted by Allyn Malventano

Fixed. Thanks for the catch guys!

August 11, 2018 | 12:47 PM - Posted by albert89 (not verified)

PC per has a long history of shilling for Intel and Nvidia at the cost of AMD. As far as I can tell they have no reason to change. Their motto is fake tech reviews and to hell what anyone thinks.

August 13, 2018 | 04:17 PM - Posted by Allyn Malventano

Yeah, such a long history of that (PCPer.com is previously AMDmb.com / Athlonmb.com). Also funny how our results line up with other reviews. Must be some grand conspiracy theory against AMD. /sarcasm

August 14, 2018 | 07:13 AM - Posted by Kareha

This is why I wish Ryan would turn verified comments back on so asshats like the previous one don't post. I don't understand why it was turned off in the first place, it made the comment sections much more bearable and pleasant to read, now, not so much.

August 7, 2018 | 01:43 PM - Posted by Paul A. Mitchell (not verified)

Allyn, Going way back to a conversation we had many months ago (years?), given the low price per GB, is there any performance to be gained by joining these QLC devices in a RAID-0 array? The main reason why I ask is the "additive" effect of multiple SLC-mode caches that obtains with a RAID-0 array. I'm using this concept presently with 4 x Samsung 750 EVO SSDs in RAID-0 (each cache=256MB), and the "feel" is very snappy when C: is the primary NTFS partition on that RAID-0 array. How about a VROC test and/or trying these in the ASRock Ultra Quad M.2 AIC? Thanks, and keep up the good work!

August 7, 2018 | 02:53 PM - Posted by Allyn Malventano

Yeah RAID will help as it does with most SSDs. For SSDs with dynamic caches, that means more available cache for a given amount of data stored, and a better chance that the cache will be empty since the given incoming write load is spread across more devices.

August 8, 2018 | 12:21 PM - Posted by Paul A. Mitchell (not verified)

Many thanks for the confirmation. I don't have any better "measurement" tools to use, other than the subjective "feel" of doing routine interaction with Windows. But, here's something that fully supports your observation: the "feel" I am experiencing is snappier on a RAID-0 hosted by a RocketRAID 2720SGL in an aging PCIe 1.0 motherboard, as compared to the "feel" I am sensing on a RAID-0 hosted by the same controller in a newer PCIe 2.0 motherboard. The only significant difference is the presence of DRAM cache in all SSDs in the RAID-0 on the PCIe 1.0 motherboard, and the SSDs on the newer PCIe 2.0 motherboard have no DRAM caches. I would have expected a different result, because each PCIe lane in the newer chipset has twice the raw bandwidth of each PCIe lane in the older chipset. With 4 x SSDs in both RAID-0 arrays, the slower chipset tops out just under 1,000 MB/second, whereas the faster chipset tops out just under 2,000 MB/second.

August 8, 2018 | 12:31 PM - Posted by Paul A. Mitchell (not verified)

p.s. Samsung 860 Pro SSDs are reported to have 512MB LPDDR4 cache in both the 256GB and 512GB versions:

https://s3.ap-northeast-2.amazonaws.com/global.semi.static/Samsung_SSD_8...

As such, a RAID-0 array with 4 such members has a cumulative DRAM cache of 512 x 4 = 2,048MB (~2GB LPDDR4).

August 9, 2018 | 11:10 PM - Posted by Allyn Malventano

DRAM caches on SSDs very rarely cache any user data - it’s for the FTL.

August 12, 2018 | 01:40 PM - Posted by Paul A. Mitchell (not verified)

Thanks, Allyn. FTL = Flash Transaction Layer
https://www.youtube.com/watch?v=bu4saRek7QM

August 7, 2018 | 03:51 PM - Posted by Dark_wizzie

So the tests are done with practically a full drive, right? Written sequentially except for last 8GB which are written to randomly. In a normal drive even when My Computer says the drive is full there is still a little bit of space left over, so you put 18GB of space free. So is this test simulating what it's like to have a full or close to full drive from the user's perspective?

Anandtech's tests made a big deal about performance changing from empty versus full. Anandtech didn't figure out when that performance drops (if it's a cliff or a gradual decline), but it almost makes the reader feel like you might want to buy double the capacity you normally need just to be safe. It's probably not that bad, but it feels like that emotionally.

August 7, 2018 | 04:22 PM - Posted by Allyn Malventano

Performance gains due to drive being empty are typically leveled out once you hit 10-20% or so (lower if you’ve done a bunch of random activity like a Windows install, etc. My suite does a full pass of all measurements at three capacity points and then applies a weighted average to reach the final result. The average weighs half full and mostly full more heavily than mostly empty performance. The results you see in my reviews are inline with what you could expect with actual use of the drive.

August 7, 2018 | 05:07 PM - Posted by Power (not verified)

"Heavy sustained workloads may saturate the cache and result in low QLC write speeds."

Looks like up to a third of good HDD level, right? Scary.

August 9, 2018 | 11:11 PM - Posted by Allyn Malventano

A third sequentially. Random on HDD is still utter crap. Also, it’s extremely hard to hit this state in actual use. I was trying. 

August 7, 2018 | 07:32 PM - Posted by asdf1 (not verified)

hey Allyn, is there a way to include these few tests. one where exam QLC sequential write performance once SLC buffer fills up. another being similar to Anand's sequential fragmentation sequential performance testing for both read/write.

August 9, 2018 | 11:17 PM - Posted by Allyn Malventano

The sustained write performance appears in two tests - saturated vs. burst (where I show it at variable QD - something nobody else does), and on the cache test, where you can see occasional dips to SLC-> QLC folding speed. Aside from a few hiccups it did very well and was able to maintain SLC speed during the majority of a bunch of saturated writes in a row. If you need more than that out of your SSD and the possibility of a slow down is unacceptable, then QLC is not for you and you’ll need to step up to a faster part.

August 7, 2018 | 07:34 PM - Posted by asdf1 (not verified)

oh and FFS PLEASE PLEASE remove google recaptcha its a waste of time, it took me TEN minutes to solve and to make 1 post

August 8, 2018 | 09:13 AM - Posted by Anonymous2 (not verified)

And you wasted it on that?

August 8, 2018 | 09:34 PM - Posted by ReCrapThisGoogleYouSuck (not verified)

Google Recaptcha and street signs! All those damn street signs and no proper explanation of just what Google considers a street sign. If you get too good at solving the ReCrapAtYa the AI thinks you are an automated bot!

Google's ReCrapAtYa AI has gone over to the HAL9000 side and is evil to the power of 1 followed By 100 zeros! Just like Google's search AI that forces you to accept it's twisted judgment of just what it thinks you are looking for that's not actually what you where looking for. Google's search engine has become the greatest time thief in history of research.

Google's Recaptcha AI is the damn Bot and Google search now returns mostly useless spam results. Google is a threat to civilization!

August 9, 2018 | 11:13 PM - Posted by Allyn Malventano

Sorry. Without that we spend more time culling spam posts than we do writing articles. 

August 11, 2018 | 08:01 AM - Posted by EddieObscurant (not verified)

Nice review, Allyn the dram on the 660p is 256mb and not 1gb. http://www.nanya.com/en/Product/3969/NT5CC128M16IP-DI#
You can also confirm it with the other reviews of the 660p.

Why do you think intel choose that size instead of the classic 1mb dram for 1gb nand?

Do you think it hampered performance?

August 15, 2018 | 12:10 PM - Posted by Anonymoo (not verified)

Dumb question time:
is it possible to make the entire drive work in SLC mode? With the size of the drives these days I could sacrifice the space for the speed and reliability.

August 15, 2018 | 02:23 PM - Posted by Allyn Malventano

So long as you only partition/use 1/4 of the available capacity, the majority of the media should remain in SLC-only mode.

August 16, 2018 | 04:52 PM - Posted by Anonymoo (not verified)

I wonder if there is a way to force it at the firmware level. Might be a good selling feature. I am sure i am not the only overcautious nerd who would value a modern 'SLC' drive.

August 16, 2018 | 08:36 PM - Posted by pgj1 (not verified)

I didn't see any mention of which NVMe drivers were used during this review. Not sure if the Windows drivers are much different than Intel's own drivers.

August 18, 2018 | 12:32 PM - Posted by JokesOnYou77

@Allyn, you mentioned in the podcast that you weren't able to saturate the writes with a copy. Rather than doing a copy have you considered creating data in RAM and then writing that? For example, create a huge numpy float and write it as binary to disk. Or a simple C program that just writes random noise to disk in a while 1 loop. Maybe even just pipe /dev/urandom to a file in several different terminals at once.

August 27, 2018 | 09:14 AM - Posted by Nick (not verified)

Hello, Allyn!
Did you use IPEAK to create custom trace-based test suite?

September 6, 2018 | 04:49 PM - Posted by Allyn Malventano

IPEAK and similar developer tools were used to capture traces, but our suite's playback workloads are based on analysis of those results, not directly playing back the streams. We do this so that we can properly condition and evaluate SSDs of varying capacities, etc.

September 10, 2018 | 05:50 PM - Posted by Cerebralassassin

May I ask when these 660p NVMe ssds will be readily available in the market place? I see the 512GB model at Newegg.com but neither that sku or any other sku at Newegg.ca OR Amazon.ca OR anywhere... :( I would like to buy the 1TB model personally.

January 29, 2019 | 03:44 PM - Posted by cw (not verified)

Don't buy from the evil non-tax-paying Intel corporation. Crucial have a new 1Tb QLC nvme ssd, Write Endurance 200Tb, 1Gb dram cache, at newegg.ca (CA$192, US$145):

https://www.newegg.ca/Product/Product.aspx?Item=N82E16820156199&Descript...

November 11, 2018 | 10:41 PM - Posted by CrazyTasty (not verified)

First of all, thanks for all of your ridiculously in-depth storage reviews. PC Perspective is my first, and usually only, stop when looking to purchase new storage.

Second, I believe there is a typo on the "Conclusion" page. You listed the 2TB endurance as "200TBW" instead of the "400TBW" Intel specs it as on ARK.

Happy Veterans Day from a fellow vet. Thank you for your service!

December 29, 2018 | 04:30 PM - Posted by ClearStale (not verified)

All three capacities have 256MB of DRAM, not 1GB. This was already pointed out by a previous reader.

Also, the 660p uses a static SLC cache that is 6GB, 12GB, or 24GB, along with a dynamic SLC pool.

It's possible this drive is using Host Memory Buffer or compressing the LBA map.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote><p><br>
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.