It seems that everyone I know has a story to tell about the time they lost work when their computer died. The more dramatic versions of these stories typically involve tight deadlines for projects with big bucks on the line. And the usual epilogue to these stories is how they will never ever let this happen again. Backing up data for these folks is now a fanatical obsession.
Fortunately, we live in an age where storage is relatively cheap and plentiful, and systems are available to automate backups. So it’s easy to set things up so that you never have to worry about losing data.
In business-speak, disaster recovery is defined as the set of policies, procedures, and systems to enable the continuation of critical business functions. Any outage of the production systems may result in disruption of a business and cause financial loss. To mitigate the risk of such outages, companies can design and deploy disaster recovery systems to minimize data loss and downtime.
In general, operational risk is defined as the risk of losing value when bad things happen – earthquakes, hurricanes, attacks by killer bees, etc.
Operational risk is managed by keeping losses within some level of risk tolerance which is determined by balancing the costs of improvement against the expected benefits. Having redundant components or full system replication is a means to mitigate risks of failures.
This table shows common operational risks, with examples and possible solutions to mitigate risks.
|Hardware/software failure||Server crash||Redundant system components|
|Localized outage||Loss of power to a server room||Local system replication|
|Site outage||Hurricane||Regional system replication|
A system is always in one of two modes: primary or backup. Only one of the systems can be in the primary mode at any time. Under normal operation, the initial active system is running in primary mode and updates are sent to another system (or systems) operating in backup mode. This allows the backup system to be ready to assume the primary mode in case of an emergency.
When a disaster is declared, a system that was previously operating in backup mode starts operating in primary mode. This event is called failover. When the failed system is restored, it then acts as the backup to the newly designated primary system. This event is called failback. Note that the primary and backup roles are interchangeable between the systems.
DR System Configurations
There are multiple server configurations that can be used in Disaster Recovery systems. The diagrams below show typical configurations. Common to all configurations, system A is continually backed up to system A’. If system A fails, clients can immediately connect to system A’ while system A is restored.
Active / Passive
In the Active / Passive configuration is the most vanilla of DR configurations. Server System A is continually backed up to System A’. System A’ is only brought online when System A fails, in which case the clients would connect to System A’.
Active / Passive configuration is the most basic configuration for DR. The A’ system is effectively a “warm standby” system that will only be used in the case of a disaster.
Active / Active
In the Active / Active configuration both Systems A and A’ are used by clients and are continuously synchronized with each other. If either System A or A’ fails, then the clients of the failed system switch over to connect to the working server.
Active / Active configuration has an added advantage in that both systems can be used simultaneously. However, it may have complications such as conflicts if the same data is changed by different people in both the A and A’ systems. Systems that allow for Active / Active configurations typically allow setting policies for conflict resolution.
Shared Active / Active
In the Shared Active / Active configuration two separate systems are over-provisioned to act as a mutual backup. If system A/B’ fails, then the A clients would connect to system A’/B, and vice versa.
The Shared Active / Active configuration has advantages over the two previous configurations in that both systems can be continuously used, without the complications of data conflicts.
You can see more DR configurations in this technical brief.
An Interplay Production system is commonly deployed with Avid shared storage systems, either ISIS or Avid NEXIS. These systems can be configured to have local replication using the configurations listed above. For example, here is an Active / Passive configuration.
A second complete system can be configured at another location within the site. This system is considered the Backup Workgroup. An instance of the Interplay Copy service is configured to continuously back up the data in the Primary Workgroup to the Backup Workgroup. This process copies the clips and sequences (asset metadata) as well as the video and audio files (asset essence) to the Backup Workgroup.
For regional replication, where the connection between the two systems has high latency, i.e. over a WAN, the mirroring to the backup system can be performed using the backup and restore functionality in Interplay. A synchronization application can be configured to make a copy of the Interplay backup data to the backup system in an Active / Passive configuration.
The following commands are scheduled to run at a given interval (daily, every 8 hours, every hour, etc.):
- A backup of the Primary Workgroup runs
- The synchronization application copies the backup data from the Primary Workgroup to the Backup Workgroup
- The database is restored to the Backup Workgroup
The media files can be copied using the same synchronization application. A File Gateway system is configured to allow access to the remote backup system using the CIFS client.
Applications for Synchronizing Files
There are several applications available for synchronizing file systems. Mirroring is used for Active / Passive systems, where the primary file system is copied to the backup system. Synchronization is used for Active / Active systems, where changes made on either file system are made on the other system.
This table shows free utilities that can be used to actively mirror or synchronize file systems:
|rsync||Win, Linux, Mac||Yes||No|
|Unison||Win, Linux, Mac||Yes||Yes|
These applications all have the option to use file dates to optimize scanning and file transfers. When files are copied, the new file’s modification date is updated to match the original file. This allows the application to cut down on file “scanning” to determine if a file has been changed, making the process faster.
The applications also have the ability to skip over specified folders. This can be used to prevent the copying of files that are actively being created.
Using Scripts for Continuous Backups
Note that all three sync apps mentioned above work as a “one shot”. The commands will not run in a continuous mode without some further scripting. However, this can be easily achieved in most scripting languages, using power of “goto technology”:
echo mirroring system-a to system-a-prime %date% %time%
robocopy system-afolder system-a-primefolder /e /purge /xd creating
timeout /T 60 > NUL
The script above will repeatedly copy new or changed files from System A to A’. It starts in the directory named “folder”. The “/e” flag specifies a recursive copy, which scans all subdirectories looking for files to copy. The “/purge” flag causes the deletion of files and directories that no longer exist on System A. The “/xd” flag will cause the application to skip over files in any folder named “creating”.
Also note the number 60 in the timeout command. This specifies a one minute delay between file system scans to reduce the CPU and IO load. This number can be tuned to balance the I/O load and frequency of backups to the DR system.
DR Tag Team at the ACA
At the ACA in Vegas last week, I gave a presentation on DR with Dan Keene from World Wrestling Entertainment. WWE is a global sports entertainment company headquartered in Stamford, Connecticut. The company is one of the largest producers of original content distributed to 180 countries across the globe. They produce more than 40 hours of original programming every week. Dan discussed WWE’s plan to build a regional Active / Passive DR system to duplicate their production media on Avid shared storage and Interplay to a virtualized environment at a remote location using the techniques mentioned above.
We got some good questions after the talk. Here are a couple of them with answers:
Q: What software are you using for switching clients from the primary to the backup system?
A: WWE uses a utility called Production Selector from Jelly Bean Media to automate the connection to their systems.
Q: How do you keep the users and workspaces between the primary and backup shared storage systems in sync?
A: Currently it’s a manual process. Changes to users and workspaces on the primary system must be made to backup system. Avid is looking into using the Data Migration Utility to help automate making these changes.
By the way, if you happen to be in the Washington DC area on May 25, you can catch my talk on DR at the SMPTE Bits by the Bay Conference.
Utilizing system redundancy, locally and/or regionally, is the best way to ensure that media production systems keep running smoothly. Various DR configurations can be deployed to meet the needs of your business. There are free tools available to automate the backing up and restoring of data to local or remote systems.
Using these techniques will ensure that your big budget project won’t go bust if bad things happen. Knock on wood.