Copyright Copyright © 2010 EMC Corporation. Corporation. Do not Copy - All Rights Reserved. Reserved.
The objectives for this module are shown here. Please take a moment to read them.
EMC Data Domain - 1
Copyright Copyright © 2010 EMC Corporation. Corporation. Do not Copy - All Rights Reserved. Reserved.
The objectives for this lesson are shown here. Please take a moment to review them.
EMC Data Domain - 2
Copyright Copyright © 2010 EMC Corporation. Corporation. Do not Copy - All Rights Reserved. Reserved.
The objectives for this lesson are shown here. Please take a moment to review them.
EMC Data Domain - 2
Copyright Copyright © 2010 EMC Corporation. Corporation. Do not Copy - All Rights Reserved. Reserved.
Shown in the slide is a Data Domain deployment. A Data Domain system is a storage system that deduplicates data on arrival. It has shelves of disks, and it has a controller. controller. It’s very optimized, first to backup and second to archive applications, and supports most of the industry leaders. Data Domain easily integrates with the existing backup or archival environment. This includes not only EMC’s offerings offerings with Networker but also Symantec, Commvault, and so on. Data can be transferred into the Data Domain storage system, using either Ethernet or Fibre channel. With Ethernet Ethernet it can use mass protocols and NFS or CIFS, it can also use optimized protocols, such as open storage, custom API with Symantec.net Symantec.net backup. After the data is stored and it’s deduplicated during the storage process, it can replicate for disaster recovery, recovery, replicating only the compressed deduplicated unique data segments that have been filtered out through the right process on the target target tier.
EMC Data Domain - 3
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
A typical backup environment without Data Domain involves writing backup data to tape. In order to protect against disasters, the tapes must be shipped offsite. This is an expensive and labor intensive task.
EMC Data Domain - 4
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
When Data Domain is implemented in a backup environment, data is written to disk instead of tape. Disk provides faster performance than tape and has other characteristics that provide protection. Data Domain is able to deduplicate data which reduces the size of the data footprint. Instead of physically shipping tapes to remote warehouses, data can be transferred across the network to a remote Data Domain system.
EMC Data Domain - 5
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
A Data Domain Appliance is a controller with its own disk array. The controller handles the deduplication processing and other processes necessary. It runs on its own Data Domain operating system. Double controllers are available in order to provide redundancy.
EMC Data Domain - 6
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Shown in the slide is the Data Domain family and details on their specifications.
Refer to the following link for the latest information on Data Domain models: http://www.datadomain.com/images/products/Appliances-Table.jpg
EMC Data Domain - 7
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Components under high mechanical or electrical stress are protected under a N+1 redundancy. This means that the components have at least one extra independent backup component. This extra component is able to resume operations should a primary component fail. As shown in the picture, extra fans and power supplies are included. RAID 6 protects against dual disk drive failures.
EMC Data Domain - 8
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
One of the most conventional approaches to deduplication competing with Data Domain is using what’s known as a post process deduplication. In this architecture, data is stored to a disk before deduplication, and then after it’s stored, it’s read back internally, deduplicated and written again to a different area. Although this approach may sound appealing, seeming as if it would allow for faster backups and the use of less resources, it actually creates problems: First, more disk is needed to store both the raw data temporarily and the deduplicated data. Post Process deduplication also has an impact on speed because post process deduplication systems are usually spindle-bound. There are typically three or four times more disks in a post -process configuration than you’ll see in a Data Domain deployment. An inline approach is also much simpler. If data is all filtered before it’s stored to disk, then it’s just like a regular storage system: it just writes data; it just reads data. There’s no separate administration involved in managing multiple pools, some with d eduplication, some with regular storage, managing the boundary conditions between them. Any less administration in the storage system is always better. So by being simpler and smaller to provision, and in-line approach and especially a CPU-centric in-line approach will always be more attractive.
EMC Data Domain - 9
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Within a Data Domain System, there are several levels of logical data abstraction above the physical disk storage. Protocol namespaces, such as virtual tape libraries, EMC Data Domain Boost, and CIFS/NFS shares act as an external interface to applications. A single Data Domain may use any combination of these for storing and accessing data. Files and directories for the namespaces are stored in the Data Domain filesystem. Non CIFS/NFS data is stored under special directories. A Unique segment collection is a collection of deduplicated data. It is here that sub-file objects of about 8 KB are identified and deduplicated. Identical segments will be stored only once. The last layer is the physical disk. Deduplicated data is stored on SATA disk drives and is RAID 6 protected.
EMC Data Domain - 10
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Stream Informed Segment Layout is the way that Data Domain approaches deduplication. It provides deduplication in a highly efficient manner. Instead of being disk based, SISL uses a CPU centric method. It does this by reducing the amount of times that disks need to be accessed. In order to quickly identify segments, data is stored along with a “fingerprint” that represents the data segment. The Summary Vector is a data structure held in RAM. It is used to identify unique segments of data. Almost all segments are identified through the Summary Vector. This saves the system from doing a lookup in the on-disk index. The Data Domain system stores neighboring segments of data together in a unit called Segment Localities. These are held close together on disk. This way, consecutive data segments can be accessed in a single disk access.
EMC Data Domain - 11
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
This slide shows how data is written to the Data Domain system using the SISL process. First, data is stored in non-volatile RAM. Here it is broken into segments and fingerprints for each segment are created. The fingerprint for each segment is compared to the Summary Vector. It there are no matches, the segment and consecutive segments are compared to multiple segments on disk. If the segment is unique, it is stored on disk.
EMC Data Domain - 12
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Data is compressed in order to further reduce the capacity needed. This is done du ring the write process. Compression options are shown on the slide.
EMC Data Domain - 13
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Data Domain is designed using Data Invulnerability Architecture (DIA). DIA provides data integrity and recoverability within the Data Domain system. Since data is deduplicated, a single segment of data may be used across multiple files. If this segment were to become corrupted, multiple files could become corrupt. This makes it crucial to ensure that d ata is intact. There are four aspects of DIA. End to end verification is the process of ensuring that data has been written correctly. After data is written to the system, it is checked against the original data to make sure it was written correctly. Fault avoidance and containment is used that data already on disk is not overwritten or corrupted. This is accomplished using a special file system that does not overwrite old data. Continuous fault detection and healing is a proactive process that continuously watches for failures. RAID 6 and check sums are used to implement this. Snapshots are used to provide file system recoverability. This protects against software and hardware failure.
EMC Data Domain - 14
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Snapshots are a read-only copy of backup data. A snapshots is useful for saving a directory copy at a specific point in time, where it can later be used as a restore point. The snapshot feature creates a image of the Data Domain file system. This protects against both human and system errors.
EMC Data Domain - 15
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
As an appliance, the Data Domain system automates all routine maintenance tasks. One of the most important automated processes is the filesystem cleaning operation that must be scheduled to reclaim physical storage occupied by deleted objects. When application software expires backup or archive images, they are deleted in the sense that they are no longer accessible or available for recovery from the application. However, the images still occupy physical storage. Only a clean operation reclaims the segments used by files that are deleted and are no longer referenced. Cleaning can require a lot of system resources while it is occurring. Mechanisms are in place to automatically adjust the priority assigned to cleaning tasks in favor of more time critical processing tasks. Cleaning schedules are adjustable. By default, cleaning is scheduled to start every Tuesday at 6:00am.
EMC Data Domain - 16
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Cleaning provides the opportunity to reorganize the data to improve the speed and efficiency of deduplication. Data invulnerability requires that data is always only written into new containers, and this requirement also applies to the cleaning process. Copy forward segments are segments that for read efficiency should be stored adjacent to each other and so they are copied forward together into a single container. Dead segments are dead because the files that referred to them have all been deleted, and the pointers have been removed. Dead segments are not allowed to be re-written with new data since this could put valid data at risk of corruption. Instead valid segments are copied forward into free containers to group the remaining valid segments together. When the data is safe and reorganized the original containers are appended back onto the available disk space.
EMC Data Domain - 17
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Administrators need to understand how to configure and monitor the reports and logs for error conditions. Data Domain systems provide access to the following types of reports and logs that provide information about error conditions: Autosupports and Alerts can be sent by email. Autosupport sends a daily email to Data Domain Support containing various log files and other system information. This allows Data Domain Support to quickly be informed of any issues that may arise in the Data Domain system. Syslog can be configured to publish logs, alerts, and messages. SNMP can also be configured to send a subset of alerts as traps to third-party SNMP managers.
EMC Data Domain - 18
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
The autosupport email list is used in two ways: send a daily detailed report on a specified schedule and send a daily alerts summary about non-critical hardware situations and disk space usage numbers that should be addressed soon. The autosupport command can also be used to send the output of a specific command or the contents of a file to the distribution list. By default, Data Domain systems send daily autosupport reports to Data Domain tech-support via email using SMTP. The autosupport report contains system configuration information, alerts summaries, performance statistics and system messages. By default, Data Domain systems are also configured to send daily alerts to the autosupport list that notify Data Domain tech-support about non-critical error messages or warnings about problems on the system that should be fixed as soon as possible. Customers have the option to configure who receives autosupports and alerts and the time they are sent. For how to configure autosupports and alerts, see the “autosupport” and “alerts” command descriptions and options in the DD OS Command Reference Guide.
EMC Data Domain - 19
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Alerts are sent with either a Warning or Critical severity. For example, a Warning alert is sent when a fan fails. When alerted, customer support contacts the owner to arrange a replacement. Warning alerts are sent when a non-critical system problem is detected. This type of problem should be fixed as soon as possible. The warning is sent to the autosupport email list as soon as the problem occurs. Warnings are also included in the Daily Alert Summary and with the Autosupport Summary. Critical alerts are sent when a sever problem occurs that should be fixed immediately. They are sent to the alerts email list as soon as the problem occurs.
EMC Data Domain - 20
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
The objective for this lesson is shown here. Please take a moment to review it.
EMC Data Domain - 21
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Replication is used to protect against disaster. This is accomplished by sending data from one Data Domain to another over the network. In a Data Domain system, only unique data is replicated. This is made possible because of the deduplication process. This saves enormous amounts of bandwidth since only a small portion of data stored will be changed. Since not as much data is transferred, the replication window is reduced. There are three types of replication. These will be discussed on the following slides.
EMC Data Domain - 22
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Collection replication is the transfer of all backup data. It is able to replicated along with all backup and recovery functions. Data at the target is accessible immediately. In addition to data, user accounts and passwords are also replicated as are snapshots. Only a one-to-one configuration is allowed.
EMC Data Domain - 23
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Directory replication is the transfer of individual directories on the Data Domain system. A Data Domain system can be a source or destination for multiple directories and can also be a source for some directories and a destination for others. Many topologies are supported with directory replication. Normal backup and restore operations are still able to be performed during replication.
EMC Data Domain - 24
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Pool replication is a type of directory replication that replicates directories that contain VTL tape cartridges. Virtual tape libraries use a structure called storage pools within the Data Domain. This data which is sent to the virtual tape can be replicated. Only one VTL license is required for the source.
EMC Data Domain - 25
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
One way to send data to the Data Domain system is through the use of CIFS or NFS shares. CIFS can be used by Windows clients while NFS is used by UNIX based operating systems. A directory within the /backup directory is shared out to the client. When data is sent to the shared directory, it is deduplicated and stored automatically.
EMC Data Domain - 26
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
OpenStorage server software, which is a feature of Symantec’s Veritas NetBackup, integrates NetBackup with Data Domain system disk backup devices. It allows NetBackup media servers to communicate with disk devices without emulating tape. In order to enable OST software, a plugin must be installed on the NetBackup media server in order to integrate with Data Domain. The Data Domain then creates Logical Storage Units which are used as NetBackup storage servers.
EMC Data Domain - 27
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Using the Data Domain VTL feature, backup applications can connect to and manage Data Domain as if it were a tape library. In this configuration, Data Domain creates virtual tapes that will act as real SCSI tape drives. Tapes an pools can be replicated to other Data Domain systems for disaster recovery. Tapes can also be locked with retention to prevent them from premature deletion. The VTL feature can be used simultaneously with the other interfaces.
EMC Data Domain - 28
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Data Domain Boost is an option that distributes part of the deduplication process out of the Data Domain system and onto the backup server. This makes the backup network more efficient, it makes Data Domain systems 50% faster, and it makes the whole aggregate system more manageable. It works across the entire Data Domain product line. As shown in the diagram on the slide, the segmentation, identification, and compression is handled n the backup server instead of on the Dat Domain system. This means that only the unique segments are sent over the network.
EMC Data Domain - 29
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
The Data Domain retention lock licensed software feature enables organizations to protect records in non-writeable and non-erasable formats for a specified length of time up to 70 years. This means that although the protected data can be read, it cannot be modified or deleted until the retention period has expired. This can be used in order to protect against accidents and user errors. And also malicious activity. For example, a Data Domain system may be used to store email records. A malicious person may attempt to delete some incriminating emails, but would be unable to do so if the retention has not expired. Retention minimums and maximums can be set globally for the Data Domain system. For example, it can be configured so that all files must have a retention of at least 5 years. Retention values can be set on a file by file basis.
EMC Data Domain - 30
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
With the sanitize function, deleted files can be overwritten using a DoD/NIST compliant algorithm and procedures. No complex setup or disruption is needed. Sanitizing is electronic equivalent of data shredding; it removes any trace of deleted files. This feature is designed primarily to support the needs of o rganizations that are required to remove and destroy confidential data if it was accidentally written to an unapproved system or to delete data that is no longer required.
See the Electronic Data Shredding Technical brief at http://www.datadomain.com/pdf/TechBrief-ElectronicShredding.pdf
EMC Data Domain - 31
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
With Data Domain Encryption licensed software option enabled, all incoming data is encrypted inline before it is written to disk. This is also referred to as “encryption at rest.” This improves security by preventing data from being read directly from disk without being first decrypted by the system. Data Domain implements software-based encryption, so no additional hardware is required. Encryption is transparent to the access protocols. Because of this, no change is needed in configuring the rest of the environment to deploy encryption at rest.
EMC Data Domain - 32
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
The objective for this lesson is shown here. Please take a moment to review it.
EMC Data Domain - 33
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Data Domain systems can replace both the large staging disk and the tape system. Replication across the WAN is built into the Data Domain systems instead of requiring a separately managed function of the primary storage. Configuration of the backup software such as the Oracle Recovery Manager (RMAN) does not need to be changed; simply point the backup application at the Data Domain storage as a replacement for the previous NFS, CIFS, or VTL device. Copies of the data needed for longer term archive or compliance can continue to be written to tape either onsite or at the offsite disaster recovery site.
EMC Data Domain - 34
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Backup and recovery for a Microsoft Exchange Server environment is a mission critical function that benefits from all of the advantages of replacing tape based systems with Data Domain appliances. In addition to being storage for the typical Exchange backups, Data Domain systems can also be used as an efficient storage repository for email archiving applications. Instead of email archives being stored on a separate system, the archives can be written to the same Data Domain system that is storing the Exchange database backups. The significant amount of duplicate data found in both the Exchange backups and in email archive files is deduplicated across both data sets, to reduce the storage footprint even more. Without Data Domain, different interface or file protocol support needs of the Exchange backup server and the email archive server may have prevented these from backing up to the same device. Being able to use CIFS, NFS, and VTL simultaneously to access a single Data Domain system opens up many new possibilities for combining data from different sources to take advantage of the savings from deduplication.
EMC Data Domain - 35
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
VMware sites tend to create more data to manage and protect than their physical counterparts. Making it simpler to multiply servers tends to increase the storage footprints. The operational flexibility offered by being able to have multiple copies and variants of a virtual image with various configurations comes at the expense of needing to buy more storage to back up and protect these images. Since many of the elements are the same between virtual images, they tend to deduplicate very well when stored on Data Domain systems. Deploying a system at the disaster recovery site allows for replication of critical VM images that can be kept up to date and ready to assume operation immediately in a disaster. Data Domain systems are attached to the high capacity backbone network used for storing and moving the VM images. Installation and configuration is similar whether the system is being used with VMware infrastructure, third party enterprise backup software, or specialized VMware backup applications.
EMC Data Domain - 36
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
In this example, a nearline implementation is used to handle some version control software that is using a Data Domain system as storage. The software tracks changes to documents as they are being updated. Since file differences are usually minor, the opportunity for deduplication is large. Data does not need to be accessed frequently, but needs to be immediately available for the times that it is accessed.
EMC Data Domain - 37
Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.
Data Domain is also useful in an archive situation. This example stores mostly static files. Files are not read back frequently but access to files needs to be immediate. This example uses a CIFS share to implement the solution.
EMC Data Domain - 38