IBM knows how to go big or go home, and their Almaden, California research lab’s current storage project exemplifies that quite nicely. With a data repository that dwarfs anything we have today, IBM is designing a 120 Petabyte storage container. Comprised of 200,000 hard drives, the new storage device is expected to house approximately 1 trillion files or 24 billion 5MB MP3 files. To put that in perspective, Apple has sold 10 billion songs as of February 24, 2010; therefore, you could store every song sold since the Itunes Store’s inception twice and still have room for more!

More specifically, the Almaden engineers have designed new hardware and software techniques to combine all 200,000 hard drives into horizontal drawers that are then all placed into rack mounts. In order to properly cool the drives, IBM had to make the drawers “significantly wider than usual” to cram as many disks as possible into a vertical rack in addition to cooling the disks with circulating water. On the software side of things, IBM has refined their disk parity and mirroring algorithms such that a computer can continue working at near-full speed in the event a drive fails. If a single disk fails, the system begins to pull data from other drives that held copies of the data to write to the replacement disk, allowing the supercomputer to keep processing data. The algorithms control the speed of data rebuilding, and are able to adapt in the event multiple drives begin failing.

In addition to physically spreading data across the drives, IBM is also using a new file system to keep track of all the files across the array. Known as the General Parallel File System (GPFS), it stripes files across multiple disks so that many parts of a files can be written to and read from simultaneously, resulting in massive speed increasing when reading. In addition, the file system uses a new method of indexing that enables it to keep track of billions of files without needing to scan through every one. GPFS has already blown past the previous indexing record of one billion files in three hours with an impressive indexing of 10 billion files in 43 minutes.

The director of storage research for IBM, Bruce Hillsberg stated to Technology Review that the results of their algorithms enables a storage system that should not lose any data for a million years without compromising performance. Hillsberg further indicated that while this 120 Petabyte storage array was on the “lunatic fringe” today, storage is becoming more and more important for cloud computing, and just keeping track of the file names, type, and attributes will use approximately 2 Terabytes of storage.

The array is currently being built for a yet-to-be-announced client, and will likely be used for High Performance Computing (HPC) projects to store massive amounts of modeling and simulation data. Project that could benefit from increased storage include global weather patterns, seismic graphing, Lard Hadron Collider (LHC), and molecular data simulations

Storage research has an amazing pacing, and seems to constantly advance despite pesky details like heat, fault tolerance, aerial density walls, and storage mediums. While this 120 Petabyte array comprised of 200,000 hard drives is out of reach for just about everyone without federal funding or a Fortune 500 company’s expense account, the technology itself is definitely interesting and will trickle down advancements to the consumer drives.

Image Copyright comedy_nose via Flickr Creative Commons