E. Pinheiro, W. Weber, and L. Barroso. Proceedings of the 5th USENIX Conference on File and Storage Technologies, page 2--2. Berkeley, CA, USA, USENIX Association, (2007)
Abstract
It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
%0 Conference Paper
%1 Pinheiro:2007:FTL:1267903.1267905
%A Pinheiro, Eduardo
%A Weber, Wolf-Dietrich
%A Barroso, Luiz André
%B Proceedings of the 5th USENIX Conference on File and Storage Technologies
%C Berkeley, CA, USA
%D 2007
%I USENIX Association
%K disk drive failure google hard hdd paper storage test
%P 2--2
%T Failure Trends in a Large Disk Drive Population
%U http://dl.acm.org/citation.cfm?id=1267903.1267905
%X It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
@inproceedings{Pinheiro:2007:FTL:1267903.1267905,
abstract = {It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.},
acmid = {1267905},
added-at = {2016-03-24T13:52:17.000+0100},
address = {Berkeley, CA, USA},
author = {Pinheiro, Eduardo and Weber, Wolf-Dietrich and Barroso, Luiz Andr{\'e}},
biburl = {https://www.bibsonomy.org/bibtex/2e11b761a3c78ab8242efff0fb0aa3182/jil},
booktitle = {Proceedings of the 5th USENIX Conference on File and Storage Technologies},
interhash = {633f3bf21cb0bff12068126751821907},
intrahash = {e11b761a3c78ab8242efff0fb0aa3182},
keywords = {disk drive failure google hard hdd paper storage test},
location = {San Jose, CA},
numpages = {1},
pages = {2--2},
publisher = {USENIX Association},
series = {FAST '07},
timestamp = {2016-03-24T13:57:27.000+0100},
title = {Failure Trends in a Large Disk Drive Population},
url = {http://dl.acm.org/citation.cfm?id=1267903.1267905},
year = 2007
}