Great read, thanks
Henrik
PS: Have already experiences it on a small scale, and what a great releif it is to find that its up and running again after a reboot. Obviously that raises the next question to replace or to continue? I have now replaced all the disks with flash drives, and I haven't seen it since. I still got the drives and will probably still put them to use somewhere with redundance :-)
thanks
Agreed on the linked reference article.
I come across anywhere from 2 to dozen or so failed drives per year give or take for the year. Sometimes if the drive is put in a test environment where it doesn’t get as hot, that will appear to solve the issue. Sometimes a reboot will solve the issue and often neither will.
I have to wonder if the tests they did in the article addressed the issue of relocating the suspect drive to a cooler environment and/or one that did not vibrate as much. I also wonder if they let a drive run for weeks after an unexpected failure? These variables can have a huge impact on reliability studies, and the article didn’t mention it, or if it did, I missed that.
But anywho, I agree that once a drive shows itself as faulty in a production environment, I would refuse to return it to the production environment, unless I’ve found to a certainty that something other than the drive was the culprit. I suppose there is a cost point where it is worth it to use a suspected flaky drive, but it would be a small dollar value. Drives don’t cost enough to justify jeopardize even an hour of time for 5 > people who rely on the drive. If done where there are 20 or > people who use the drive, doing so amounts to stupid management.
Of course some business IT monkeys don’t even bother to label the drives for date a drive is placed in service, but I digress.