In which I discuss the various ways of backing up all of your stored data.
Big Data on a Budget
Good evening everybody and welcome to this short chat on backing up big data on a budget.
How many times have you heard it? Back-ups are important. you need to back up your data so that you don’t lose it in the event of a catastrophe. The question arises: how many back-ups should you have? Where should they be stored? How much do you want to spend protecting your data?
There are a number of options for backing up your stuff. some people use external hard drives. Some people use CD and DVD media, blueray media, tapes. Some people even print data out to be scanned back in. But let’s consider the average hoarding person. You’ve managed to collect a couple of terabytes of data over the years because let’s face it, you’ve been on the internet, you’ve got a reasonably fast internet connection and your hoarding instincts as a person have resulted in you downloading a stack of stuff that is considered highly, highly important to you.
The first thing you need to decide is how much of your data actually needs to be backed up. Yes, it would be inconvenient if those three seasons of Game of Thrones were deleted off your hard drive but you could probably torrent them down again. Not so much the back-ups of your commitment ceremony or some song you slogged four hours out on to work with with a group of friends that are never ever going to meet in person ever again.
So first of all, decide whether you’re backing everything up or whether you’re only backing the important stuff up. If you’re backing up everything, then that’s pretty streight forward and we can move on to the people who might be backing up some of their stuff.
How much do you want to spend on storage and how reliable do you want that storage to be? Nothing is ever a hundred percent reliable so you’re always working against probabilities, equipment failure and acts of God to decide how stable and reliable your back-ups are going to be.
If you have under a terabyte of data, then potentially one of the online cloud services such as Google Drive or Dropbox or Box or Spider Oak might be worth looking at. However, it’s worth considering: do you trust these services to look after your data and if you intend to encrypt the data before you back it up, how are you going to do that? With what program, what algorithm and what are you going to do in the event that you lose a pass phrase?
There are various back-up services that claim to be able to back up unlimited amounts of data for, for example, $5.00 a month per computer and I’m referring here to services such as Crash Plan or Back Blaze. A couple of things to keep in mind with these services. you probably want to use the local encryption settings so that your data is encrypted before it leaves your computer so that even if Crash Plan or Back Blaze or one of the similar services is subpoenaed for your data, nobody will be able to decrypt it. This is probably important if you’ve nicked a whole pile of pirated stuff or you’ve got a whole lot of data on your hard drive that you really shouldn’t have; child porn; terrorist bomb plans; the list goes on. (I hope nobody listening to these audio boos has those sorts of things on their hard drives.)
Now Crash Plan claims that they do in fact back up unlimited data but there’s a couple of things to consider when backing up unlimited data. How fast is your internet upstream? If your internet upstream is fast enough, you might be able to push one to ten gig of data a day. It still means that backing up two terabytes of data is going to take a significantly long time. You can drop $350.00 on a C drive that they send to you and you fill up with up to a terabyte of data and you ship it back to them. They preload that to your account. However it is still going to take significant time and significant bandwidth, possibly impacting your internet use, to decide whether that is actually worth backing up all of that data over the internet.
If you decide to back it up though, Crash Plan, Back Blaze, whatever, will store all of the stuff that you need to store. Do however be aware of the terms and conditions of the plan and read the find print to make sure that unlimited truly does mean unlimited.
Another option for people may be external drives and a lot of people are seeing external hard drives on Amazon for $129.00 so that you can buy a 5 terabyte hard drive. These are probably quite handy and can store a lot of data but a couple of things need to be kept in mind. As I always say and as many other storage experts say, it’s not if a hard drive will fail, it’s when a hard drive will fail. All hard drives fail eventually. Some of them fail within a day or a week or a month of being owned and some of them are still ticking away for ten years. The probability of hard drive failure is something you can read research papers on and there can be endless debates as to whether Western Digital or C Gate or insert your other favourite drive brand, is the best type of hard drive. But even that can fluctuate between manufacturing batches, temperature considerations and shock considerations. So if you are going to go out and buy yourself a five terabyte hard drive, it might not be a bad idea to go out and buy two five terabyte hard drives so that you’ve got one that’s connected to your computer as hot storage and a back-up drive that acts as your back-up in case something goes wrong with the hot storage. You’ll need a way to keep the two drives synchronised. If you’re on a PC platform, I would strongly suggest something like Robo Copy. Teracopy is fine in the GUI and the Microsoft Sync Toy will handle GUI lists of folders that need to be kept in sync, however I’m not sure what the limitations on Sync Toy are. Robo copy will quite happily copy terabytes from one drive to another and keep the archives in sync. you do however need to be careful with Robo copy however, because you need to specify the /xo switch so that it doesn’t copy old information over new information. It’s also worth noting that Robo copy is a command line utility and unless you get hold of a friendly geek to help you with the batch file, then you may have trouble automating this. Also, not all batch files are created equal. I’ve seen a lot of batch files for Robo copy missing the /xo switch. But a peruse of the Robo copy documentation does in fact point out that /xo is somewhat important. You also need to exclude the files that you don’t want to back up with Robo copy such as BTSync folders, dropbox control information, etc.
the other thing you probably want to keep in mind is if you’re on a mac or a linux box, you may want to consider RSync for backing stuff up. RSync is handy in the fact that it is quite flexible, can handle thousands and thousands of files and can be fired off from chron jobs relatively easily.
So you have two hard drives. One primary, one secondary. You went and ponied up and got two five terabyte hard drives. It’s up to you whether you switch the hard drives around on a weekly or monthly basis so that the spare becomes the regular one and the regular one becomes the spare. But there’s another thing that you need to keep in mind. Even storing data on hard drives has a probability of failing. Drives have an uncorrectable bit error rate that means that occasionally they’re not going to be able to pull back a sector that was written to them. This doesn’t happen very often but it does in fact happen. What is the guarantee that all of the data that you have written to your drives is actually uncorrupted?
I would strongly suggest finding a utility that will generate SHA1 sums or MD5 sums of trees of files. Make lists of the files on your hard drive with their MD5 or SHA1 sums and scatter the manifest and catalogue across a couple of cloud services so that if you do need to run a test on a hard drive to see if indeed it is failing, you consider pulling back the SHA1 sums and running it against the data to catch any differences. At least that way you will be able to tell which of your two drives is good and which has gone bad.
the other thing to keep in mind is that all of this takes some time and some ingenuity to actually set up. you’ll have to find the right utilities; you’ll have to find the right batch files and you’ll have to be disciplined enough to actually carry out this back-up plan on a regular basis. Things like Carbonite, Crash Plan and Back Blaze make it easy because services run in the background that back this data up to the cloud or your external hard drives or your friends’ computers. Now it’s worth mentioning that if you do use Crash Plan to back up data to your friends’ computers that you’ll be using their hard disk space and you’ll have to negotiate with them but also keep in mind that the data is in fact encrypted and your friends don’t get access to your Crash Plan data.
If you are going to back up data to locally connected hard drives, you may wish to consider whether the data should be encrypted to protect it from prying eyes. But the SHA1 sums are certainly worth considering.
So some of you were going to ask me well how do we even know if hard drives are going to fail? Is there any warning that a drive is on the way out? Well it turns out that in approximately 70 to 80% of cases, there is actually warning that a drive is going to fail. The technology that tells you this is known as “Smart”: Self-Monitoring Analysis and Reporting Technology. There are utilities for Linux, MacOS and Windows called Smart Mon Tools that can run in the background on your system as a service and can provide you with information about impending drive failures. This is fairly good for connected drives that are directly installed in the computer. But some USB to Sata bridges don’t pass through the Smart Inquiry commands in a standard method. you may have to do some fiddling to actually get Smart Mon Tools to check these.
Other external drives such as the WD series of drives and some of the C Gate drives do come with software that is meant to monitor the health of the drive and warn you potentially of an impending drive failure. It’s possible that the warning will not come in time though and a mechanical fault that stops the drive from powering up or stops the drive from spinning will give you no amount of smart warning even if you do choose to use this technology. So Smart is one of those things that just makes things a little bit safer and a little bit more informative. It has allowed me however to replace failing arrays in raid arrays.
Raid. I suppose I should mention Raid. A Redundant Array of Inexpensive Disk drives. If you have multiple copies of data, then it’s less likely that you’re going to lose the data. This is a pretty simple idea. Raid1 is an exact mirror of the data. Two drives contain exactly the same data. It’s fast to write data to both drives identically. It actually doubles read speed if you’re using both the drives but it does halve write speed. So raid cards with caches and stuff can be useful so the operating system can dump a bit under a gig of data at the drives and the drives can get on writing it to the array. Be aware though there are pitfalls to Raid.
Most people consider that going out and building themselves a Raid5 array in a home NAS is probably going to be a good way to back up the three or four terabytes of data they’ve got. A couple of things with Raid5 arrays that I have learnt from painful experience. When a drive fails in a Raid5 array, it is imperative that you replace the failed drive as soon as possible. that is dash on down to the computer store the day the drive fails or have another drive as hot spare. There is however a problem with Raid5 and that is that once one drive has failed in the array, it is potentially possible and in fact more probable than you’d think, that whilst rebuilding the array onto the spare drive that you’ve just replaced, one of the second existing drives will fail. If two drives fail in a Raid5 array, you’re essentially left doing block level restores and dragging as much data off the arrays that appears readable as possible with tools that might break your brain. Raid5 does use a fair amount of CPU. So for example some of the home NAS’s which vary in accessibility such as the QNAP, etc, they’d use a 1.2 GHz arm processor and a cut down version of Linux with maybe 256 or 512 meg of ram, will read and write to the raid drives fairly OK but will burn a fair bit of CPU doing it. These home NAS’s will take anywhere between two and four devices and are fairly quiet and fairly low power and have Itunes servers and all sorts of other stuff in them. The question you’ve got to ask yourself is how accessible are the web interfaces, are they in fact usable, and are you going to pay for the extended tech support who will help you rebuild the array in the event that it fails or are you an MDADM ninja and can SSH into the thing like I usually do and rebuild the arrays by hand provided that they’re willing to be rebuilt? Raid5 is a nice option because it is n -1 drives. So if you have four two terabyte drives — so that’s eight terabytes of actual storage — and you put them in a Raid5 array, one drive is used or the amount of storage for one drive because in Raid5 the parity is actually plexed across all of the drives, one drive’s worth of data is used for parity redundancy information which means that with four two terabyte drives, you will end up with six terabytes of usable storage minus a little bit for administrative overhead in raid5 array. Raid could probably have its own discussion and I could do an entire boo on Raid and that may happen another night.
Raid6 is a little bit nicer because we run the array with dual parity drives. This means that you can handle the loss of two drives in a raid array that is in Raid6 mode. You don’t really win much though if you’ve only got four drives in your raid. Four minus two is two so if you’ve got four two terabyte drives, you only end up with four terabytes of fairly reliable storage in Raid6. Raid6 however does start to make sense if you have six or eight drives. If you have eight drives in Raid6, you lose two drives for redundancy which means that if we have 16 terabytes of storage and we have eight drives, we subtract two drives for storage and we end up with 12 terabytes of usable storage which means that the storage ratio is more efficient the more drives you have in Raid6. However don’t think you’re going to go out and build a Raid6 array with 27 drives. Unfortunately, the more drives you put in a raid array, the increased chance of failure that one of the drives is going down for the count and isn’t coming back up again. In fact, I could probably do an entire boo on the failures and shortcomings of raid.
But that will give you some idea as to how safe your data may or may not be. For the technically apt, you could potentially store your data on cloud services such as Google Drive or Amazon S3. You will however have to be fairly competent with command line tools and web API’s unless you’re going to use something like Amazon Back-up for S3 or Amazon S3 Explorer which are two of the almost accessible apps for Windows. There are of course command line tools for the Linux and Mac users such as S3Cmd that will put and retrieve objects from buckets on Amazon S3 including multi-part uploads etc. Amazon S3 however does cost and you’ll have to look at their pricing page. Essentially three cents a gig a month to store last time I looked in the US West 1 region and nine cents per gig to actually retrieve the stuff from Amazon S3. You can store the data for 0.01 cents a gig if you push the data off into glacier which is to say that when you push data off into glacier, the storage costs reduce amazingly however the restore time jumps to three to five hours if you restore an Amazon Glacier batch. Also if you store more than I believe 25% of your data, there are extra restoration fees for Glacier. Somebody’s probably running around in a data centre somewhere jamming tapes into tape drives. I have no idea whether this is actually true and if anybody has any information about how Glacier actually works, I’d be happy to hear from you.
Google is playing with a new set of technologies which is currently in beta called Google Nearline Storage. The ability to back up piles and piles of data to Google services with the correct web API’s with a three to five second restore time. I don’t know whether that classes as warm storage but if the technology matures and becomes reliable, it could be quite useful for people running a blindy radio station. Back up a couple of terabytes of music that you’ve got for your radio station on the Google Nearline Storage and have a jukebox application that allows you to pull back a song in three to five seconds from warm storage whilst the other song is queued and playing or you’re banging on about what time it is and how many friends are tuned in.
Look guys, if you have any questions about big data and big data home storage, I’d be happy to hear from you and I’d be happy to answer them. I don’t know whether this talk has been useful to anyone or instructive. If there are any things people want me to talk about specifically, I’d be more than happy to put some posts out there to inform you guys about how to handle big data storage and stuff like that. If you’ve listened this long, thank you very much for listening and feel free to leave comments on my blog www.kerryhoath.com. Goodnight everyone.