Luminous Landscape Forum

Raw & Post Processing, Printing => Digital Image Processing => Topic started by: wolfnowl on August 04, 2007, 03:08:22 pm

Title: RAID 5 Crash
Post by: wolfnowl on August 04, 2007, 03:08:22 pm
Hi Folks:

This is lifted from the latest Imaging Resource newsletter (http://www.imaging-resource.com/IRNEWS/)

With the discussion of backup strategies and RAID systems on here, I thought it would be relevant.  The rest of the newsletter is worth reading too...

Mike.


"Publisher's Note: IR Site Returns to Normal

It was a long and hairy battle, but the Imaging Resource servers are now back to normal.

    Apologies for our outage last Friday through Sunday. It was testimony that the best laid plans of mice and men (but particularly the latter) are prone to going astray.

    The short version of the story is that a bad drive appears to have taken down our entire Redundant Array of Independent Disks or RAID. Compounding matters, problems with the DiskSync backup system meant that a restore process that should have taken only a few hours went on for almost 20 hours.

    A word of warning to users of RAID 5 systems: The redundancy built into RAID 5 will tolerate the failure of any one drive in the system, but only in the sense that it can reconstruct missing data. This is now the third time in my career that I've seen a RAID 5 system fail completely. (I ran a small systems integration company in a previous life, else I probably wouldn't have seen as many.)

    In this case, one of the drives in the array failed in a way that caused it to interfere with data flowing over the SCSI bus. Data from all the drives consequently became corrupt, as did any "reconstructed" data that was written back to fix an apparently failing drive. Because the fault affected data from all drives, the system wasn't able to identify which drive was the actual source of the problem. It did indicate a particular drive in the array, but that drive was actually fine, it just happened to be the drive at the particular position along the SCSI bus that was most affected by the bus problem.

    So we had to completely wipe the array (actually moving to a new server chassis and full array of entirely new hard drives), reload the OS and restore from our online backups.

    This was where the second major hassle developed. We use a system marketed by our ISP under the name of DiskSync. It proved horrendously unreliable. It did indeed preserve all our data (it does seem to do a good job of that), but kept hanging whenever it encountered a symbolic link (or alias) in the directory structure. This meant that the process proceeded by fits and starts, needing to be restarted many times.

    It may be we just didn't know critical info about how to use DiskSync, but a utility that purports to be a backup solution for Linux shouldn't hang whenever it encounters a symlink, even in its default configuration.

    Going forward, we're going to configure our servers so one of the secondary boxes will be able to stand in for the main server in a pinch. Performance might be lower on the secondary box and some of the housekeeping and deployment services on the primary box won't be supported, but the site itself would be able to stay up and running.

    Longer term, we plan to install our own hardware here in Atlanta to have hands-on access ourselves when we need it. This solution will also involve a 24x7 "hot spare" synced with the primary server every couple of hours, so we can transfer operations with the flip of a virtual switch.

    On the face of it, this would be a more expensive solution than our current one using DiskSync. But when you consider that the revenue we lost as a result of this outage could have paid for the duplicate hardware in one fell swoop, the economics change."
Title: RAID 5 Crash
Post by: Roy on August 04, 2007, 04:01:15 pm
Thanks for posting that.

I have joined in on more that one thread about backup pointing out that a backup strategy needs to cover more than the failure of a hard drive. It should account for risks such as fire, theft, natural disaster, user error, and equipment failure (such as a RAID controller).

But boys love their toys and the conversation usually devolves back to who provides that best RAID for the least cost.

I suspect that most photographers using RAID systems don't need them and have a false sense of security. Scheduled backups to off-line storage plus regularly refreshed off-site copies aren't sexy and take a bit of work, but they are far more effective than RAID.
Title: RAID 5 Crash
Post by: englishm on August 04, 2007, 04:46:49 pm
What Roy said!

RAID should never ever be construed as a stand-in for a good backup strategy.  I have seen all flavours of RAID even RAID-1 (mirror) fail.  I've seen a single drive in a mirrored pair fail in such a way that it took down the whole array, requiring a re-install and restore.  Nothing about hard-drives is bullet proof.

At the moment I update two backup copies every night to external drives using scripted Robocopy routines that wake themselves up at 3am every morning.  When I'm away, one of these goes out to a detached garage.  Next up is a CAT-5/6 line and an NAS box out in the same garage so I have a fresh off-site backup every night.  Am I paranoid?  Yes! But am I paranoid enough?