We often talk in general terms about archiving data. But how many of us really know what that term means and what should we be looking for in an archive system?
To many people, an archive is just a “large, cheap and reliable” storage system. Somewhere to dump your old tax-returns, photos or kids reports – then inevitably forget about them and rarely ever access them again. But in truth, what we should be looking for in a high-performance compute archive system, what is critical to its success, is that you CAN retrieve your data in the future, and that it’s quick and easy to retrieve in five, ten, even twenty year’s time.
It doesn’t need to be high performant, but it should be easy.
This might seem a simple concept, but it raises many issues. For example, you might be able to read data back from the archive system, but how do you know it hasn’t changed? How do you know it is exactly as you wrote it 20 years ago?
Wait, what? Data can change once you write it to a medium?
Sadly, yes. For example, one possible cause is ‘bit-rot’, which is the random flipping of the 1’s and 0’s that represent your data on the disk, tape, or memory.
To at least know if you have suffered some sort of corruption, you need to ensure the data is “checksummed”. What is checksumming?
Consider your data is 123456.
A very simple checksum would be 1+2+3+4+5+6 = 21 . So you store your original data 123456 AND the checksum 21. When you read your data back, you check that the checksum is still correct. If it isn’t then you have a problem – and corrective action needs to be taken. However, you need checksumming everywhere, at every stage. At the data generation stage, when you store the data on the local system, at the network transferral stage, at the archive storage stage, and when you read the data back in. Any component in the system can have a failure at any time, and needs to be continually managed.
Next, you need to scrub your data. Data scrubbing is an error correction technique that involves regularly reading the data and comparing it to the checksum. It enables early detection of data change so corrective action can be taken immediately.
A crucial archiving step is to take redundant copies of your data. Redundancy is the existence of replicant data to allow detection of errors and aid recovery from corruption. This could be a complete second copy of the data, or something more sophisticated. BUT all components of the data need to be checksummed independently, so you know which component is correct. Remember, it could be one or both copies of your data 123456 or the checksum 21 that has changed. You need to be able to tell which bit of data has changed. It can also be important to keep redundant copies in different physical locations. So that a disaster in one location can be recovered from the other.
Another important issue is to have the archive system built from components that are reliable and fault-tolerant. For example, if a component of the storage medium fails, it needs to be replaced quickly and easily, and any lost data automatically recovered from the redundant copies.
Finally, it is often best to use standardised open formats. This will ensure that, twenty years down the track, you still have software that can read your data.
At DUG, we provide all these features via our online-archive product – built on the back of years of experience, handling big data. How does your archive solution measure up?