Data in digital form offers many advantages. A large amount of data can be stored in a very small space. It can be processed and converted using IT systems. You can look up a particular piece of information just by pressing a few keys, no matter how large the data is.
Regardless of all the convenience digital data offers, it also suffers from a drawback: duplication. A large amount of storage may only be occupied by duplicate files and records. It can be more evident in data-intensive applications without any data de-duplication strategy in place. In some cases where data is poorly managed, there can be many instances of a same file, resulting in storage shortage.
Automated biometric identification
Identification on the basis of biometric identifiers of a person like fingerprint, iris patter, voice, etc. is called biometric identification. When it is performed using automated systems, it becomes automated biometric identification. In automated biometric identification, human intervention may not be required and identification can be solely performed by the biometric system. Process of biometric identification is basically the process of seeking a pre-established identification data of an unclaimed identity on the basis of biometric identifiers.
Biometrics has already started posing as the future of identification and authentication and future applications of biometrics will generate a large volume of biometric data. Automated biometric identification is expected to play the key role in the future applications. Being crucial for identification and sensitive in nature, duplication in biometric data will be an intolerable phenomenon.
Why does duplication occur?
Data duplication is a very common occurrence in today’s digitized life. A simple drag and drop can create a duplicate copy when handling your personal data. At personal level, it may not affect us much, however, when it comes to large data-centric business applications, data duplication can be a concern.
Digitization is on the rise and it is good as digital data can be handled more efficiently than data stored in other forms (such as paper based documents). Despite being easy to process, digital data suffers from the problem of duplicates. When backed up, processed or transferred in large volumes, digital data is can be prone to duplication. De-duplication of large volume of data is an intimidating but important task.
What is biometric de-duplication?
Data de-duplication is a process that eliminates redundant copies of data and reduces the overhead of storing information. Thus, this technology is aimed at optimizing storage capacity. Biometric de-duplication, however, is more concerned about removing duplicates due to its criticality in identification and authentication. Finding and eliminating duplicate biometric records from a large collection of biometric data is called biometric de-duplication.
Regardless of the method employed, data de-duplication ensures that only one unique piece of information is stored on the media. In this regard, an important point of this technology is the level of detail. De-duplication can be done at the file, block, and byte level. Each method has its advantages and disadvantages.
Why duplication in biometric data occurs in the first place?
While performing data related operations like transfer, migration, back up, editing, etc. (especially when dealing with large volumes of data), duplication may occur. Sometimes duplication in biometric data may occur due to errors during the enrollment. In large scale biometric data collection drives, e.g. biometric voter ID, biometric nation ID campaigns, etc., biometric data is collected from different locations and transferred to the main database. In such large scale drives, possibility of data duplication elevates.
Why biometric data de-duplication is required?
Duplicates in biometric data may result in lack of reliability, system inefficiency and lead to errors in identification. When there are duplicates in the stored data, a system has to work longer to index and compare the records because it has to go through duplicate records as well. Time taken to retrieve any query the also increases, reliability of identification remains questionable despite that. Duplicates in biometric data also consume storage, which could be used to store clean data otherwise. It takes a significant toll on system resources and cost.
Storage de-duplication technology
Storage based de-duplication eliminates redundant files and reduces the amount of space required to store data. It results in many benefits which are discussed later.
One of the most common methods for data de-duplication is comparing and detecting duplicates in chunks of data. To enable compare-and-detect methodology, chunks of data are assigned an identification using cryptographic hash function.
File level de-duplication
File level de-duplication compares a file with files already saved. If the file is unique, it is saved; if such a file already exists on the device, only the pointer (link) to the existing file is saved, thus, only one copy of the file is always saved, and subsequent copies are linked to the original file. The advantages of this method are simplicity, speed and almost without performance degradation.
Block level de-duplication
Block-level de-duplication is the most common de-duplication method that analyses a piece of data (file) and stores only unique repetitions of each block. A block is a logical unit, so it can have a different size (length). All data fragments are processed using a hash algorithm, such as MD5 or SHA-1. This algorithm creates and stores in the de-duplication database an identifier (signature) for each unique block.
If during the life cycle, the file has changed, only the modified blocks get into the storage, not the entire file, even if only a few bytes have changed.
There are two types of block de-duplication with constant and variable block lengths. Variable length de-duplication breaks files up into blocks of different sizes, resulting in a higher reduction rate for data storage than fixed-length blocks. The disadvantages of variable block de-duplication include lower speed and the creation of a large amount of metadata.
De-duplication at the byte level
De-duplication at the byte level is similar to the principles of its work with de-duplication at the block level, only instead of the blocks there is a byte-by-byte comparison of new and modified files. This method is the only method that guarantees the complete elimination of data duplication, but it has very high performance requirements.
Making a conclusion from the above, it can be argued that de-duplication at the block level is the most optimal way; it is much more effective than de-duplication at the file level and not as resource-intensive as byte-based. However, it requires some serious processing power.
De-duplication on the basis of place of execution
In backup, in addition to the methods described above, de-duplication may differ in the place of execution, at the data source (client), on the storage device side (server) or as a joint work of the client server.
Client side de-duplication
Client side de-duplication is performed directly at the source, therefore using only its computational resources. After de-duplication, the data is transferred to the storage device. Client side de-duplication is always implemented only through software. The disadvantage of this method is the high load on the processor and the client’s memory, and the advantage is the ability to transfer data over networks with low bandwidth.
Server side de-duplication
Server side de-duplication is possible when data is transferred to the server completely in its raw (original) form (without compression or encoding). De-duplication on the server is divided into hardware and software.
The hardware is performed using a de-duplication device, which is a separate hardware solution that combines the logic of de-duplication and data recovery. The advantage of the hardware method is the ability to transfer the entire load of de-duplication from the server to a separate device, and make the de-duplication itself a completely transparent process.
Software de-duplication uses specialized software that takes all the work of de-duplication on itself. But in this case it is necessary to take into account the load on the server to perform de-duplication.
Client server de-duplication
With collaborative client server de-duplication, processes are performed on both the client and the server. Before sending data from the client to the server, two devices first try to figure out what data is already in the storage, for this the client calculates the HASH for each data block and sends it to the server as a file, like a sequence of hash keys.
The server accepts and compares the received hash keys with its HASH table, then sends the response to the client in the form of a new list of HASH keys that it does not have in the table. And only after that, the client sends data blocks to the server. The effectiveness of this method is achieved through the transfer of data processing to the server and low network load, since only unique data is transmitted.
Rolling hash and Rabin fingerprint de-duplication
Computing a hash value of a few consecutive bytes at every byte position in the data stream can solve the problem of determining chunk boundaries. A chunk boundary can be declared where hash value matches a certain predefined pattern. Rolling hash is a hash function, which uses a sliding window to hash input and provide hash value at each point.
Named after the Israeli mathematician and computer scientist Michael O. Rabin, Rabin fingerprint de-duplication is a hashing function used in data de-duplication to fingerprint data chunks.
Fingerprint is a hash value that is specialized for the specific chunk, since the hash functions are collision resistant; having the same fingerprint means (with a very high probability) the two data blocks are identical. So to avoid wasting resources on storage of duplicate data only one of them will be stored. If the data block is different from another their fingerprint would be different, so both of them will be stored.
Rabin fingerprint is used to find boundaries of a data chunk, which leads to identification of the chunk. With sliding windows, fingerprint of each window can be computed quickly based on the previous fingerprints.
De-duplication vs. compression
De-duplication, as the name signifies, is a strategy to eliminate duplicates from data. A typical example of this would be a document distributed to all employees in a large organization. If the document volumes 10MB and there are 1000 employees, it will cost the organization more than nine gigabyte of storage, which could otherwise be using only 10MB if only once instance of the document was stored and used.
This is just a small example, let’s take this scenario to a cloud storage service, on which millions of people store their data. Many of these user files may be media (music albums, movies, ebooks, etc.) which will be same for all users. Just one instance of such files can be stored and reference is served to all users, it will result in huge reduction in storage requirements.
|Scenario||Content||Usual space savings|
|User Documents||Office documents, photos, music, videos, etc.||30-50%|
|Deployment Shares||Binary software files, CAB files, symbols, etc.||70–80%|
|Virtualization Libraries||ISO images, virtual hard disk files, etc.||80–95%|
|File share||All of the above.||50–60%|
Like de-duplication, data compression is also a strategy to reduce the size of data; however, both are entirely different approaches. While de-duplication focuses on eliminating redundant files and keeping only one instance, data compression aims to reduce the size of a file by removing redundant data within the file.
Data compression is accomplished by data compression utilities, which makes use of compression algorithms to compress data. How much a file can be compressed, depends on the type of the file, for example a 5MB document may be compressed to occupy just 2MB on disk, while a 5MB audio file may only be compressed to 4.5MB.
For the best data storage practices, compression must follow de-duplication and not the other way around. Compression redundant data would not bring real benefit as duplicate record will also be compressed and the anticipated reduction in storage requirement may not be achieved.
Benefits of de-duplication
In today’s data driven world, de-duplication not only helps organizations save on allocating new resources (e.g. new storage media, hard-drives), it can also make their data transfers less time consuming, and bandwidth efficient, resulting in further reduction in cost. Organizations that deal with huge volumes of data (e.g. cloud storage services and platforms) can particularly benefit from de-duplication.
- Removal of redundant files / records results in efficient use of available storage capacity. De-duplication can extend time for organizations for allocating new storage server, which leads to cost saving.
- Lesser time required in data related operations such as data recovery, transfer, etc.
- Less storage requirements means lesser money spent on storage media.
- Lower bandwidth consumption to transfer the data over a network (e.g. the internet).
- Lack of redundancies leads to clutter free clean data. De-duplicated data is easier to maintain.
Risks of data de-duplication
The main problem associated with de-duplication is data conflict, which may occur if two different data blocks generate the same hash key. In case of a conflict, a database corruption occurs that causes a failure when restoring a backup. Larger the database, higher the frequency of changes and bigger the possibility of conflict situations. The solution to this problem may be to increase the hash space, since the larger the hash keys, the less likely the conflict. Currently using a 160-bit key generated by the SHA-1 algorithm. These are 2160 = 1.5 x 1048 unique hash keys.
As the world turns more and more digital, data de-duplication has become more relevant than ever. Need of biometric data de-duplication is also set to increase as biometrics controlled smart devices and IoT devices grow. When coupled with today’s information environment, these devices will be generating a huge amount of data every minute and without de-duplication technology in place, this data can negatively impact storage.
Data duplication can take significant toll on resources specially when dealing with large volumes of data. In data intensive services like data centers, cloud storage service, etc., where huge volume of data is stored, processed and transferred, duplication in data can be have huge impact on storage resources.
De-duplication of biometric data is also crucial as it is used to establish user identity, and leveraged for identification and authentication. Now when biometrics is being seen as the future of identification and authentication, data de-duplication will be vital in future applications of biometrics. Duplicates in biometric data can result in undesired and unpredictable consequences.