Abstract
Reproducibility is a fundamental part of scientific progress. Compared to other scientific fields, computational sciences are privileged as experimental setups can be preserved with ease, and regression experiments allow the validation of computational results by bitwise similarity. When evaluating information access systems, the system users are often considered in the experiments, be it explicit as part of user studies or implicit as part of evaluation measures. Usually, system-oriented Information Retrieval (IR) experiments are evaluated with effectiveness measurements and batches of multiple queries. Successful reproduction of an Information Retrieval (IR) system is often determined by how well it approximates the averaged effectiveness of the original (reproduced) system. Earlier work suggests that the na\"ıve comparison of average effectiveness hides differences that exist between the original and reproduced systems. Most importantly, such differences can affect the recipients of the retrieval results, i.e., the system users. To this end, this work sheds light on what implications for users may be neglected when a system-oriented Information Retrieval (IR) experiment is prematurely considered reproduced. Based on simulated reimplementations with comparable effectiveness as the reference system, we show what differences are hidden behind averaged effectiveness scores. We discuss possible future directions and consider how these implications could be addressed with user simulations.
Users
Please
log in to take part in the discussion (add own reviews or comments).