Missing Data Imputation for Galaxy Redshift Estimation

Abstract

Astronomical data is full of holes. While there are many reasons for this missing data, the data can be randomly missing, caused by things like data corruptions or unfavourable observing conditions. We test some simple data imputation methods(Mean, Median, Minimum, Maximum and k-Nearest Neighbours (kNN)), as well as two more complex methods (Multivariate Imputation by using Chained Equation (MICE) and Generative Adversarial Imputation Network (GAIN)) against data where increasing amounts are randomly set to missing. We then use the imputed datasets to estimate the redshift of the galaxies, using the kNN and Random Forest ML techniques. We find that the MICE algorithm provides the lowest Root Mean Square Error and consequently the lowest prediction error, with the GAIN algorithm the next best.

BibTeX key: luken2021missing
entry type: misc
year: 2021
url: http://arxiv.org/abs/2111.13806
note: cite arxiv:2111.13806Comment: 9 Pages, accepted at the Machine Learning for Physical Sciences workshop at NeurIPS 2021

BibSonomy

Missing Data Imputation for Galaxy Redshift Estimation

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on