Data Backup
Backing up your data is an important function for ensuring your data can be retrieved and reused.
3-2-1 Rule
The 3-2-1 rule states that you should have:
- 3 Copies of your data
- 2 Different mediums (2 different kinds of storage)
- 1 Copy is kept off-site
Ideally, this should all be automated so that you as a user do not have to manually do anything extra to manage your data. Cloud services such as SURF Drive and Office365 (OneDrive, TeamSite, SharePoint) automatically back up your data regularly across multiple data centers, mitigating any potential data loss.
If this is not automated by your system, or the system only keeps two copies of the data, we suggest regularly making extra copies of your data to another system. If possible, automate this backup as well with computer scheduling tools such as Task Scheduler (Windows) or Crontab (Linux).
Implement data backup strategy
Implementing a data backup strategy doesn’t have to be complicated. Here’s a step-by-step guide:
Assessment
Identify data that cannot be easily recreated or recalculated and ensure that it is sufficiently backed up with the 3-2-1 rule.
Determine the frequency of backups based on data changes and importance.
Assess institutional requirements dictating data backup and retention.
Select Backup Locations
Use the suggested data storage platforms provided by UU, mentioned in the matrix on our Suggested Storage Locations page. You may use a mix of the different platforms.
Choose a mix of local and offsite storage options.
Encrypt data to protect against unauthorized data access.
Make sure the backup locations of your choice have a version control functionality in place.
Determine the required storage space for your backups, by taking into consideration any potential data growth.
Data Classification
Prioritize your data and backup the most critical information more frequently. Tailor your backup strategy based on the importance and sensitivity of the data.
Assess the importance of your datasets for how crucial they are to your research or meeting your other goals. Data that is intermediary and not the final results can still be useful and still should be backed up, however, it may be considered less of a priority to back this data up.
Selecting Data to Backup
During your research you may accumulate a lot of data, some of which will be important to keep and preserve for the long term. However, it is impossible to preserve all data indefinitely, especially in several copies. Backing up multiple copies of all digital data is associated with high costs for storage itself and for maintaining and managing this ever-growing volume of data and their metadata; it may also lead to decline in discoverability. For those reasons, it is crucial that you carefully select data for backing up and preservation.
Selecting data means making choices about what to keep for the long term, and what data need to be kept in a secure and directly retrievable way. This means that you have to decide whether your dataset contains data that is considered less important or of temporary value.
Reasons to exclude data from backing up include (but are not limited to):
- Data which is redundant or incomplete
- Data concern temporary byproducts, which are irrelevant for future use
- Data which is a direct and intact copy of an open dataset which remains directly accessible from an internal or external source
- Data which was purchased by and is not intellectual property of the UU
- Data with a contractual obligation to delete
In determining which data in your dataset is worth backing up, please use this checklist;
- Determine which parts of your data is sensitive and would cause issues when you do not have possession to it
- Identify the data which is crucial to substantiate the findings of your research, which are described in research articles, papers or thesis
- Identify the data you need for preparing data publications or for the outreach of your project
- Identify the data which interesting for re-use by others, within or outside the project
- Where possible, the original (primary/rough) data should be backed up, possibly together with the code, processing scripts, or processing instructions needed to consult the data. Also permanent enriched data files which are derived from the primary data are important to back-up, so they can be used for analysis as described in the methodology section of the research.
It is recommended to start backing up when a data set has reached a stage when it represents a distinguishable data gathering phase, like initial data collection, analysis or other processing activities. Ensure that regular renewals of the backed-up data will be made.
For maintenance purposes and to ensure long-term accessibility, it is preferable that data files will be backed up in ‘sustainable’ file formats following the FAIR guidelines, where possible. A list of sustainable file formats can be found here, or you can contact your data steward for assistance in finding an open and sustainable file format.
Data Failures
It can be helpful to know how storage can fail to help you make better backup and handling decisions.
- Hard Disk Drives (Spinning drives, internal and external)
- Read/Write head and platter collision – Inside your hard drive is a read/write head that glides over your hard drive platters (the disks that store the data), and if those collide, the platters can be scratched or worse, and the disk will become unusable. Avoid moving hard disks or exposing them to vibrations, especially while they are turned on.
- Magnetic interference – The platters of your hard drive are magnetically sensitive disks, and exposure to a magnetic field can change the data stored on the platter, altering your data in a possibly unrecoverable way. Keep magnets away from hard drives.
- Solid state drives (SSDs)
- Storage Media Write Limits – inside solid state drives are silicon chips that store data as charge (electrons) in a cell, a certain charge corresponds to 0, and another to 1, sometimes more than 2 charge states can exist. There are only so many times that these cells can be recharged before the cell is unusable. Check your SMART levels on your SSDs from time to time and make sure they report as “OKAY” or that the drive is reporting at least 90% of cells are usable.
- USB Drives and Memory Cards
- Physical Loss – These devices are usually very small and very prone to becoming lost. Use this media only when absolutely necessary, and keep them in an organizer or the device that requires them, and backup to a more secure location as soon as possible.
- Low quality storage media – Many of the cheapest USB drives are made with the lowest quality storage media around, and become exhausted quickly by reading and writing electrons to the storage sectors. When possible, make sure you have 2 copies of the same data, such as a camera with dual card slots, configure the camera to write every photo file to each of the two SD cards for redundancy of original data.
- False advertising – Especially cheap or even counterfeit devices in this range are commonly sold at too-good –to-be-true prices, and that’s because they’re lying. The device not just lies in its advertising, but also to the computer it plugs into. The most common lie is about the storage size, advertising more storage than it physically has. It will report as 64GB (for example), but only have actual space for 8GB, it will accept more data written, but it will write over existing data, immediately losing your data. Do not buy cheap storage drives.
- Network drives
- Out-of-sync backups – Network drives that automate their backups can be out of sync. The main computer may have received the data, but that data didn’t get a chance to be sent to a backup server before a failure happened.
- Natural disaster – All kinds of natural (and human-made) disasters can happen anywhere, data that is kept in one geographic place is prone to be affected by the same events. Make sure the data on your network drives is kept in multiple locations.