Storage Magazine has announced their list of Best Storage Products from 2006. Storage hardware is definitely getting smarter. Two similar features that propagated through vendor products in 2006 are deduplication and single-instance storage (SIS). Both of these technologies try to identify data redundencies and to reduce the storage of multiple copies of data.
Compliance is significantly driving up the amount of electronic data that is being stored, particularly storage of emails and instant messages. Consider an email message sent to 15 staff members containing a 3MB attachment — archival of all 15 copies of the same email would be a huge waste of storage. Email archive software typically will use a single-instance storage (SIS) algorithm so that only one copy is stored and references or pointers are created that point to the shared copy.
Deduplication technology looks beyond file-level storage and tries to identify redundancies on a subfile or block level. Similar to the way SIS works, when replicated data is identified, one copy is retained and referenced from multiple places.
Deduplication is particularly useful for backups because only unique data block differences need to be backed up. The deduplication process can be broken down into three steps:
- Analysis – An image of the backup data is analyzed. The analysis may use hints from information like metadata, and file and path names.
- Redundancy Identification — Based on the analysis, identify which data pieces are redundant. Rather than bit-by-bit comparisons, a hash algorithm is typically applied to a block of data which creates a unique signature for that block. Redundant blocks are identified when their hash signatures match. Common hash algorithms used might be SHA-1, MD5 or some proprietary method.
Bit-by-bit comparisons are sometimes done too, and that produces the most efficient identification, but the method is very I/O and compute intensive. - Redundancy Elimination — Reference pointers need to be created. Depending on the product vendor, either forward or reverse referencing is used. Reverse references point to the first occurrence of the data. Forward references writes out the current data and updates the previous occurence to be a pointer to the new area.
Another consideration is something that is called in-band and out-of-band deduplication. In-band means that deduplication is performed as the data is collected and written. Out-of-band is a secondary asynchronous process that happens later.
In-band processing can slow down the backup process because of the deduplication processing. Out-of-band dumps a complete backup to disk and then applies deduplication.
Out-of-band processing has the advantage that deduplication could be run across parallel processes. But it also has the disadvantage that during the duplication process the data may need to be read more times and if multiple processes are used, there could be disk contention problems. Out-of-band also requires that an original data image be captured and then shrunk during the deduplication process which means that more disk space needs to be available.














