Back to the future -- how the Backup Deduplication Process works

by in Information Management & Governance

 As digital data volumes double every year (and even more during that last 18 month!), it’s not a surprise that managing data growth is a top priority for IT departments. Read also "Too much information: Living with Data Explosion fallout". Together with many other backup functionalities, data deduplication is still of the most important and fastest growing storage optimization techniques to help address this. There are many solutions available in the market that offer various bells and whistles, so just; how do you choose the right solution for your organization?

Even though deduplication has been around for a long time, it's okay to take a look at the past and get a refresh on how backup deduplication capabilities can help customers’ challenges.

How data deduplication works

In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. As a result of this deleting and indexing, deduplication is able to reduce the required storage capacity as only the unique data is stored.

The estimated amount of duplicate data that can be typically removed from a particular application or data type, could be represented as follows:

Application

Estimated percentage

PACS

5%

Web and Microsoft office Data

30%

Engineering Data Directories

35%

Software code archive

45%

Technical Publications

52%

Database Backup

70% or higher

In the above example PACs (a medical imaging technology used mainly in healthcare organizations) creates a type of data used in X-rays and medical imaging. These have very little duplicate data. At the other end, we typically see databases that contain a lot of redundant data—their structure means that there will be many records with empty fields or the same data in the same fields.

Data deduplication compares chunks of information to detect duplicates and stores each unique data segment only once. To achieve this, a deduplication engine allocates a unique identifier to each chunk of data using math hash functions. Once it has identified two chunks of data as identical, the system will replace the duplicate with a link to the original chunk.

There are two architectural approaches to chunking. A fixed deduplication algorithm breaks data into blocks of a fixed size. Variable chunking groups the data into blocks based on patterns in the data itself. The advantage of variable chunking is that it can recognize duplicates when only small changes have occurred and merely moves the data from one backup to the next. This technique is the most commonly used today and leads on average to 20:1 deduplication ratios, or higher.

Key elements for data deduplication

Deduplication involves a combination of three elements: the deduplication engine, the deduplication store, and backup agents.

The deduplication engine is where the majority of processing takes place. It manages the logic and processing of the backup stream by calculating segments and hash values, identifying unique and repeated segments, and maintaining the hash lookup table.

The deduplication store is the disk storage location managed by the deduplication engine. It stores the unique (deduplicated) segments, and is often physically coupled with the deduplication engine.

Deduplication-enabled backup agents (for example, media agents, disk agents, and application agents) manage some of the deduplication processes. Agents can be deployed separately from the deduplication engine to offload some of the performance impact. Agents can perform tasks such as segmenting the data, calculating the hash value of segments, and sending new data to the engine and the store. The deduplication agent talks to the deduplication engine to calculate which segments are unique.

Deduplication can take place at the application source, backup server, or at the target device.

Application source deduplication removes redundant data before it is transmitted to the backup target. Source deduplication reduces storage and bandwidth requirements. However, it can be slower than target deduplication and increase the workload on servers.

Backup server deduplication shifts the deduplication execution onto a separate dedicated server to maximize the performance of the target device and minimize the impact on the application source where the application is running.

Target deduplication removes redundant data from a backup after it has been transmitted to a hardware device. This method can use any backup application the device supports, and the deduplication process is transparent to the backup application. Backup applications can also deploy and manage target deduplication onto a variety of hardware targets such as disk arrays, tape libraries, and network-attached storage devices. Target deduplication reduces the volume of storage required for the backup, but does not reduce the amount of data that must be sent across a LAN or WAN during the backup.

With target deduplication, backup agents are not aware of the deduplication process. In backup server or application source deduplication, backup agents will have deduplication technology built-in, and will be deployed onto the backup server or application server as appropriate.

Backed-up data can be transferred in a variety of ways. In a traditional transfer, all backup data is sent. In a deduplicated backup, the backup data stream only contains the unique segments and references to duplicate segments. This reduces the network bandwidth required.

With replicated deduplication, the unique backup data is sent to a replication target, which enables efficient replication over low-bandwidth links.

New Data Protector 11.00 Deduplication Store

More than 10 years ago, Data Protector released its first deduplication functionalities since then, the data explosion has prompted organizations to seek technological solutions to optimize the utilization of their existing resources, while reducing the costs and risks of their data protection strategy.

Data Protector has recently undergone a period of improving stability, upgrade simplification and enhancing security. Data Protector 11.00 marks the start of a move of the development focus further towards feature / functionality delivery.

With the release of Micro Focus Data Protector 11.00, the biggest change, and a significant milestone for us, is the introduction of the new deduplication engine, which will significantly expand the deduplication functionality of Data Protector up to PBs.

It complements the current integration capabilities with HPE StoreOnce and DELL EMC Data Domain appliances, providing customers with much more flexibility when choosing their deduplication strategy.

The new software deduplication engine provides customers with:

  • Enterprise features normally only found in backup appliances
  • Easy setup and configuration, following existing workflows
  • Scales from 100GB to many 100TB’s
  • Source and target side deduplication
  • Support for Linux and Windows
  • Multiple folders and mount points supported
  • Can be hosted on physical or virtual machines, and be on-premise or in the cloud, in data centers or remote offices

A word of caution

 People may look at deduplication with the approach of “That is cool! Let´s buy less storage after deduplication implementation”, but it does not work that way. Deduplication is a cumulative process that can take time to yield impressive deduplication ratios. Initially, the amount of storage customers buy have to size to reflect existing backup tape rotation strategy and expected data change rate within customers´ environment.

Deduplication has become more and more popular because as data growth soars, the cost of storing that data also increases, especially backup data on disk. Deduplication technologies helps to reduce the cost of storing multiple backups on disk.

What data deduplication provides to customers is:

  • The ability to store dramatically more data online (on disk based storage).
  • Improve the Recovery Point Objectives (RPOs) by ensuring available data can be recovered from older backups that are kept for longer periods of time to better meet Service Level Agreements (SLAs).
  • Deduplication can automate the disaster recovery process by providing the ability to perform site to site replication at a lower cost. Because deduplication knows what data has changed at a block or byte level, therefore replication becomes more intelligent and transfers only the changed data as opposed to the complete data set. This saves time and replication bandwidth and is one of the most attractive propositions that deduplication offers.

But deduplication shouldn't be the only approach, and replace your investment in other technologies such as physical tapes. A good global backup strategy must support the 3-2-1 rule, this rule will allow you to deal with any cyberattacks by using different formats, such as tape, in your backup strategy or even, keeping a copy in a different format and/or offline.

Want to learn more about it? Download a 90 day trial license and access the recorded Tech preview.

We’d love to hear your thoughts on this blog. Comment below.

The Micro Focus IM&G team

Know your data | empower your people | drive your future

Join our community | @microfocusimg | www.microfocus.com | What is InfoGov?

Labels:

Data Protection
Anonymous