Dagstuhl Seminar 24211
Evaluation Perspectives of Recommender Systems: Driving Research and Education
( May 20 – May 24, 2024 )
- Christine Bauer (Paris Lodron Universität Salzburg, AT)
- Alan Said (University of Gothenburg, SE)
- Eva Zangerle (Universität Innsbruck, AT)
Recommender systems (RS) have become essential tools in everyday life, efficiently helping users discover relevant, useful, and interesting items such as music tracks, movies, or social matches. RS identify the interests and preferences of individual users through explicit input or implicit information inferred from their interactions with the systems and tailor content and recommendations accordingly [13, 16].
Evaluation of RS requires attention at every phase of the system life cycle, including design, development, and continuous improvement during operation. High-quality evaluation is crucial for a system’s success in practice. This evaluation can focus on the core performance of the system or encompass the entire context in which it is used [3, 7, 8, 10]. Research typically differentiates between system-centric and user-centric evaluation. System-centric evaluation examines algorithmic aspects, such as the predictive accuracy of recommender algorithms. In contrast, user-centric evaluation assesses the user’s perspective, including perceived quality and user experience. Comprehensive evaluation must address both aspects since high predictive accuracy does not necessarily meet user expectations [12].
The topic of evaluation, with all its challenges, is currently very relevant and trending. The PERSPECTIVES workshops (organized at ACM RecSys 2021-2023 [14, 15, 11], coorganized by this seminar’s organizers) were highly popular and attracted many participants. This interest is further evidenced by the special issue in ACM Transactions on Recommender Systems [1] on evaluation. Recent calls for more impactful RS research [5, 6, 12, 9] highlight that current evaluation practices are too narrow and may not be practically relevant. [4] advocate for more nuanced evaluation methods that meet industry demands. [9] argue that current practices are insufficient as they often overlook side effects or longitudinal impacts. A recent systematic literature study further reveals that current evaluation methods are limited in experiment design, dataset choice, and evaluation metrics [2].
This seminar on evaluation perspectives of RS brought together researchers and practitioners from diverse backgrounds. It aimed to discuss current challenges and advance the ongoing discussion on RS evaluation. The seminar began with eight presentations addressing current challenges in evaluation. These talks initiated the general discussion and helped form groups around these topics. As a result, five working groups were established, each focusing on the following areas:
Working Group 1: Theory of Evaluation
This group focused on the theoretical foundations of RS evaluation. They began by identifying the shortcomings of current evaluation practices and linking these issues to underlying theoretical principles. Key challenges discussed included the selection and configuration of evaluation metrics and the reporting of evaluation results. Section 4.1 of the full report outlines the challenges and theoretical perspectives identified in this group.
Working Group 2: Fairness Evaluation
This group focused on exploring paradigms and practices for evaluating the fairness of RS. Given the specific nature of fairness metrics and evaluation requirements for different applications, fairness problems, and goals, the group proposed “best meta-practices”, a set of approaches to planning, executing, and communicating rigorous fairness evaluation scenarios. The group’s outcome is documented in Section 4.2 of the full report.
Working Group 3: Best-Practices for Offline Evaluations of Recommender Systems
This working group addressed the topic of offline evaluation, with a specific focus on identifying problems and best practices for this evaluation method. They concentrated on pinpointing the primary challenges related to reproducibility and methodology. Subsequently, they provided guidelines to address these challenges from various perspectives, including those of paper authors, reviewers, editors, and program chairs, as summarized in Section 4.3 of the full report.
Working Group 4: Multistakeholder and Multimethod Evaluation
This group examined the challenges and complexities in evaluating multistakeholder scenarios, discussing the key aspects that must be considered in such a nuanced environment. Additionally, they explored the transition from theoretical evaluation frameworks to practical implementation. Section 4.4 of the full report outlines this work.
Working Group 5: Evaluating the Long-Term Impact of Recommender Systems
This working group concentrated on the long-term perspective and impact of RS and their evaluation. This includes developing suitable long-term measures and conducting social and behavioral research to understand and facilitate aspects such as human behavior, long-term stakeholder goals, and corresponding metrics. Additionally, the group examined practical challenges when evaluating the long-term aspects and impact of RS. This work is presented in Section 4.5 of the full report
Evaluation is an important cornerstone in the process of researching, developing, and deploying recommender systems. This Dagstuhl Seminar aims to shed light on the different and potentially diverging or contradictory perspectives on the evaluation of recommender systems. Building on the discussions and outcomes of the PERSPECTIVES workshop series held at ACM RecSys 2021-2023, the seminar will bring together academia and industry to critically reflect on the state of the evaluation of recommender systems and create a setting for development and growth.
While recommender systems is largely an applied field, their evaluation builds on and intersects theories from information retrieval, machine learning, and human-computer interaction. Historically, the theories and evaluation approaches in these fields are very different. Thoroughly evaluating recommender systems requires integrating all perspectives. Hence, this seminar will bring together experts from these fields and serve as a vehicle for discussing and developing the state-of-the-art and practice of evaluating recommender systems. The seminar will set the ground for developing recommender systems evaluation metrics, methods, and practices through collaborations and discussions between participants from diverse backgrounds, e.g., academic and industry researchers, industry practitioners, senior and junior. We emphasize the importance of getting and keeping the big picture of a recommender system’s performance in its context of use, for which it is ultimate to incorporate the technical and the human element.
We will set the basis for the next generation of researchers, apt to evaluate and advance recommender systems thoroughly

- Gediminas Adomavicius (University of Minnesota - Minneapolis, US) [dblp]
- Vito Walter Anelli (Politecnico di Bari, IT)
- Andrea Barraza-Urbina (Grubhub - New York, US)
- Christine Bauer (Paris Lodron Universität Salzburg, AT) [dblp]
- Joeran Beel (Universität Siegen, DE)
- Alejandro Bellogín (Autonomous University of Madrid, ES)
- Toine Bogers (IT University of Copenhagen, DK) [dblp]
- Peter Brusilovsky (University of Pittsburgh, US) [dblp]
- Robin Burke (University of Colorado - Boulder, US)
- Wanling Cai (Trinity College - Dublin, IE & Lero, the Science Foundation Ireland - Limerick, IE )
- Tommaso Di Noia (Politecnico di Bari, IT) [dblp]
- Michael D. Ekstrand (Drexel University - Philadelphia, US) [dblp]
- Kim Falk (Copenhagen, DK)
- Andres Ferraro (Pandora, US)
- Bart Goethals (University of Antwerp, BE)
- Neil Hurley (University College Dublin, IE)
- Dietmar Jannach (Alpen-Adria-Universität Klagenfurt, AT) [dblp]
- Olivier Jeunen (ShareChat - London, GB)
- Joseph Konstan (University of Minnesota - Minneapolis, US) [dblp]
- Dominik Kowald (Know Center - Graz, AT & TU Graz, AT) [dblp]
- Maria Maistro (University of Copenhagen, DK) [dblp]
- Lien Michiels (University of Antwerp, BE) [dblp]
- Julia Neidhardt (TU Wien, AT)
- Özlem Özgöbek (NTNU - Trondheim, NO)
- Denis Parra (PUC - Santiago de Chile, CL)
- Sole Pera (TU Delft, NL) [dblp]
- Lorenzo Porcaro (EC Joint Research Centre - Ispra, IT)
- Alan Said (University of Gothenburg, SE) [dblp]
- Rodrygo Santos (Federal University of Minas Gerais-Belo Horizonte, BR)
- Guy Shani (Ben Gurion University - Beer Sheva, IL) [dblp]
- Manel Slokom (TU Delft, NL)
- Annelien Smets (Vrije Universiteit Brussel, BE)
- Barry Smyth (University College Dublin, IE) [dblp]
- Marko Tkalcic (University of Primorska, SI)
- Helma Torkamaan (TU Delft, NL)
- Alexander Tuzhilin (New York University, US) [dblp]
- Tobias Vente (Universität Siegen, DE)
- Robin Verachtert (DPG Media - Antwerpen, BE)
- Lukas Wegmeth (Universität Siegen, DE)
- Martijn Willemsen (TU Eindhoven, NL) [dblp]
- Jürgen Ziegler (Universität Duisburg-Essen, DE) [dblp]
