@scch

Missing Data Patterns: From Theory to an Application in the Steel Industry

, , , and . Proceedings of the 33rd International Conference on Scientific and Statistical Database Management, page 214–219. New York, NY, USA, Association for Computing Machinery, (Aug 11, 2021)
DOI: 10.1145/3468791.3468841

Abstract

Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with “best guesses”. Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.

Links and resources

Tags

community

  • @scch
  • @julia_radojcic
  • @dblp
@scch's tags highlighted