Topic: False Disk Failures (Read 6440 times)

John.Murray · « **on:** March 22, 2013, 03:05:16 pm »

Fascinating read:
http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/

In a nutshell, of the reported disk failures in a datacenter, nearly 50% turn out to be a false failure. The SAS vs: SATA statistics of an actual data center are also very interesting

francois · « **Reply #1 on:** March 23, 2013, 06:38:31 am »

Quote from: John.Murray on March 22, 2013, 03:05:16 pm

Fascinating read:
http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/

In a nutshell, of the reported disk failures in a datacenter, nearly 50% turn out to be a false failure. The SAS vs: SATA statistics of an actual data center are also very interesting

Thanks for sharing this article. It is very interesting.

tived · « **Reply #2 on:** March 23, 2013, 10:13:36 pm »

Great read, thanks

Henrik

PS: Have already experiences it on a small scale, and what a great releif it is to find that its up and running again after a reboot. Obviously that raises the next question to replace or to continue? I have now replaced all the disks with flash drives, and I haven't seen it since. I still got the drives and will probably still put them to use somewhere with redundance :-)

thanks

John.Murray · « **Reply #3 on:** March 24, 2013, 01:55:00 am »

Henrik:

There are some excellent manufacturer supplied tools to verify disk health; for my personal use, I have no problem returning a disk to service.
http://www.seagate.com/support/downloads/seatools/
http://support.wdc.com/product/download.asp?groupid=606&sid=3

Professionally, there is really no choice - replace the physical drive, rebuild and move on...

Ellis Vener · « **Reply #4 on:** March 24, 2013, 08:49:06 pm »

Thanks John Murray.

Justan · « **Reply #5 on:** March 25, 2013, 01:28:38 am »

Quote from: tived on March 23, 2013, 10:13:36 pm

Great read, thanks

Henrik

PS: Have already experiences it on a small scale, and what a great releif it is to find that its up and running again after a reboot. Obviously that raises the next question to replace or to continue? I have now replaced all the disks with flash drives, and I haven't seen it since. I still got the drives and will probably still put them to use somewhere with redundance :-)

thanks

Agreed on the linked reference article.

I come across anywhere from 2 to dozen or so failed drives per year give or take for the year. Sometimes if the drive is put in a test environment where it doesn’t get as hot, that will appear to solve the issue. Sometimes a reboot will solve the issue and often neither will.

I have to wonder if the tests they did in the article addressed the issue of relocating the suspect drive to a cooler environment and/or one that did not vibrate as much. I also wonder if they let a drive run for weeks after an unexpected failure? These variables can have a huge impact on reliability studies, and the article didn’t mention it, or if it did, I missed that.

But anywho, I agree that once a drive shows itself as faulty in a production environment, I would refuse to return it to the production environment, unless I’ve found to a certainty that something other than the drive was the culprit. I suppose there is a cost point where it is worth it to use a suspected flaky drive, but it would be a small dollar value. Drives don’t cost enough to justify jeopardize even an hour of time for 5 > people who rely on the drive. If done where there are 20 or > people who use the drive, doing so amounts to stupid management.

Of course some business IT monkeys don’t even bother to label the drives for date a drive is placed in service, but I digress.

John.Murray · « **Reply #6 on:** March 28, 2013, 01:10:05 am »

Quote from: Justan on March 25, 2013, 01:28:38 am

I come across anywhere from 2 to dozen or so failed drives per year give or take for the year. Sometimes if the drive is put in a test environment where it doesn’t get as hot, that will appear to solve the issue. Sometimes a reboot will solve the issue and often neither will.

I have to wonder if the tests they did in the article addressed the issue of relocating the suspect drive to a cooler environment and/or one that did not vibrate as much. I also wonder if they let a drive run for weeks after an unexpected failure? These variables can have a huge impact on reliability studies, and the article didn’t mention it, or if it did, I missed that.

The article simply says what it says, what I find fascinating is a data-center agreeing to share specific statistics with the author and allowing them to be published. Remember, the data center referred to has ~2 million spindles.....

What bothers me the most is the idea of a 10-15$ controller having a "glitch" that ends up failing the array it's a member of. As the author states, this is completely unacceptable in other controllers such as automotive & industrial EMC's. Even given that, his speculation regarding accelerator pedal malfunctions in a handfull of Toyotas is probably spot-on....

As far as testing, you'll usually see Kesender test units, or a custom equivalent (we have the one linked). Again, we *never* return failed units into production, and I personally don't know anyone that does. Besides testing, we use our rig to securely erase failed / retired drives......

robo60 · « **Reply #7 on:** October 31, 2013, 06:23:01 pm »

Quote from: John.Murray on March 28, 2013, 01:10:05 am

The article simply says what it says, what I find fascinating is a data-center agreeing to share specific statistics with the author and allowing them to be published. Remember, the data center referred to has ~2 million spindles.....

What bothers me the most is the idea of a 10-15$ controller having a "glitch" that ends up failing the array it's a member of. As the author states, this is completely unacceptable in other controllers such as automotive & industrial EMC's. Even given that, his speculation regarding accelerator pedal malfunctions in a handfull of Toyotas is probably spot-on....

As far as testing, you'll usually see Kesender test units, or a custom equivalent (we have the one linked). Again, we *never* return failed units into production, and I personally don't know anyone that does. Besides testing, we use our rig to securely erase failed / retired drives......

Hi everyone. I am an occasional reader of this forum and web site. I wrote the blog referenced, and its a pleasure to see you guys discussing it.

A few things as follow up. 1) it turns out I was pretty much on the money with Toyota's acceleration problems. The transcripts from the lawsuit released over the last week pretty much confirm it. 2) You're absolutely right - typically no enterprise will return drives to production. There are some insidious aspects to that. 1) a drive controller chip costs ~$4-$7. A disk costs an OEM ~$40 to $200 depending on "quality", performance, and volume. But they'll often charge you $400-$600 for the drive. Whats more is they charge you a big annual service fee which includes the drive replacements. They make a lot of money on that service so they like you returning drives. Its a way to keep a revenue stream.

In contrast, it turns out array vendors like EMC and NetAPP have been managing this stuff behind the scenes for years. They do resets. Rumors are they even do full reformat - essentially re manufacture the drive in place. And of course the guys who really have financial incentives to get it right - many with millions of drives, and google with probably 10 million - they keep the drives in service.

On heat and vibration - this is usually not a factor if servers are designed properly, and in the datacenters where this information was gathered they are very well designed. So those should not be a factor at all.

Rob

Author Topic: False Disk Failures (Read 6440 times)

John.Murray

False Disk Failures

francois

Re: False Disk Failures

tived

Re: False Disk Failures

John.Murray

Re: False Disk Failures

Ellis Vener

Re: False Disk Failures

Justan

Re: False Disk Failures

John.Murray

Re: False Disk Failures

robo60

Re: False Disk Failures