@lukas.beierlieb

Searching for the Ground Truth: Assessing the Similarity of Benchmarking Runs

, , , , , , , , and . Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, page 95–99. New York, NY, USA, Association for Computing Machinery, (2023)

Abstract

Stable and repeatable measurements are essential for comparing the performance of different systems or applications, and benchmarks are used to ensure accuracy and replication. However, if the corresponding measurements are not stable and repeatable, wrong conclusions can be drawn. To facilitate the task of determining whether the measurements are similar, we used a data set of 586 micro-benchmarks to (i) analyze the data set itself, (ii) examine our previous approach, and (iii) propose and evaluate a heuristic. To evaluate the different approaches, we perform a peer review to assess the dissimilarity of the benchmark runs. Our results show that this task is challenging even for humans and that our heuristic exhibits a sensitivity of 92%.

Links and resources

Tags

community