Inproceedings,

Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated Users

, and .
Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, page 25–29. New York, NY, USA, Association for Computing Machinery, (2024)
DOI: 10.1145/3641525.3663619

Abstract

Reproducibility is a fundamental part of scientific progress. Compared to other scientific fields, computational sciences are privileged as experimental setups can be preserved with ease, and regression experiments allow the validation of computational results by bitwise similarity. When evaluating information access systems, the system users are often considered in the experiments, be it explicit as part of user studies or implicit as part of evaluation measures. Usually, system-oriented Information Retrieval (IR) experiments are evaluated with effectiveness measurements and batches of multiple queries. Successful reproduction of an Information Retrieval (IR) system is often determined by how well it approximates the averaged effectiveness of the original (reproduced) system. Earlier work suggests that the na\"ıve comparison of average effectiveness hides differences that exist between the original and reproduced systems. Most importantly, such differences can affect the recipients of the retrieval results, i.e., the system users. To this end, this work sheds light on what implications for users may be neglected when a system-oriented Information Retrieval (IR) experiment is prematurely considered reproduced. Based on simulated reimplementations with comparable effectiveness as the reference system, we show what differences are hidden behind averaged effectiveness scores. We discuss possible future directions and consider how these implications could be addressed with user simulations.

Tags

Users

  • @irgroup_thkoeln
  • @dblp

Comments and Reviews