Dagstuhl Perspectives Workshop 24352: Conversational Agents: A Framework for Evaluation (CAFE)

Dagstuhl Perspectives Workshop 24352

Conversational Agents: A Framework for Evaluation (CAFE)

( Aug 25 – Aug 30, 2024 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/24352

Organizers

Christine Bauer (Paris Lodron Universität Salzburg, AT)
Li Chen (Hong Kong Baptist University, HK)
Nicola Ferro (University of Padova, IT)
Norbert Fuhr (Universität Duisburg-Essen, DE)

Contact

Marsha Kleinbauer (for scientific matters)
Jutka Gasiorowski (for administrative matters)

Shared Documents

Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)

Publications

Christine Bauer, Li Chen, Nicola Ferro, and Norbert Fuhr. Conversational Agents: A Framework for Evaluation (CAFE) (Dagstuhl Perspectives Workshop 24352). In Dagstuhl Reports, Volume 14, Issue 8, pp. 53-58, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Summary

Show Summary

In this Dagstuhl PerspectivesWorkshop, a general model for the evaluation of CONversational Information ACcess (CONIAC) systems was developed: Conversational Agents Framework for Evaluation (CAFE).

The framework starts from the assumption that a CONIAC system will be able to (i) interact with users more naturally and seamlessly, (ii) guide a user through the process of refining and clarifying their needs, (iii) aid decision-making by making personalized recommendations and information while being able to explain them, and (iv) generate, retrieve and summarize relevant information.

CAFE distinguishes six major elements of an evaluation design:

Stakeholder goals. Stakeholders of a CONIAC system may have diverse goals that might or might not be directly accessible to system designers or evaluators and must often be implicitly inferred in evaluation. CONIAC systems might also have multiple goals ranging from end users having (in-)direct information needs, to platforms deploying CONIAC systems interested in content usage, user engagement, impression generation, and user retention, to name a few.
Tasks. CONIAC involves tasks characterized by an information need (which may be specific or rather vague), human involvement, goal orientation, and mixed initiative between the user and the system. While some tasks and information needs may benefit from introducing a conversationally competent system, others may not, depending on the complexity of the task or need.
User aspects. When developing an evaluation framework for CONIAC systems, it is crucial to consider user-specific aspects, such as preferences, specialized needs, expertise types, and background characteristics, which may make conversational systems more beneficial than non-conversational alternatives.
Criteria. The scope of evaluation can range from single-turn interactions to entire conversations and long-term system usage, each requiring different criteria for assessment. Additionally, the temporal dimension, which examines how the system's performance changes over time, is a critical factor that can intersect with both stationary and dynamic properties. Criteria may be system-centric, user-centric, or both. The former regard hardware and software aspects like e.g. efficiency, accuracy, comprehensiveness, and verifiability. For the latter, we can distinguish between conversation-oriented (like e.g. adaptability, coherence, fluency), content-oriented (like e.g. continuance, controllability, perceived accuracy, understandability), and consequences-oriented measures (like e.g. addiction, benevolence, decision quality, confidence, trust).
Methodology. In addition to the standard distinction of user-focused and system-focused methodologies, our evaluation framework categorizes evaluation methodologies also according to the employed time model - a dimension especially relevant for CONIAC. This dimension ranges from stationary methodologies like single-interaction experiments to methodologies like controlled lab studies that allow for continuous measurements such as physiological ones.
Measures. Finally, we allow for measures that typically focus on the system's ability to provide accurate, relevant, and timely information during interactions. Measures include objective measures of effectiveness and subjective notions such as perceived effectiveness or user satisfaction (e.g., self-reported satisfaction). By incorporating both objective as well as subjective (self-reported) measures, evaluators can better understand the system's strengths and areas for improvement.

When designing an evaluation, the first step is to identify the stakeholders and their goals that need to be addressed. Based on the goals, the user tasks to be studied in the evaluation have to be defined, as well as the user aspects to be considered. The central element of an evaluation are the criteria to be focused on, which can be determined by the stakeholder goals. The chosen criteria restrict the range of possible evaluation methods (e.g. any user-centric criterion requires the involvement of actual users in the evaluation procedure). Finally, an appropriate measure has to be defined for any quantitative criterion.

Creative Commons BY 4.0

Christine Bauer, Li Chen, Nicola Ferro, and Norbert Fuhr

Motivation

Show Motivation

Conversational Agents (CA) as frontends to Information Retrieval (IR) and Recommender Systems (RS) become more popular in everyday life, with a wider range of users and usages. The latest developments in Large Language Models (LLMs) will have tremendous consequences, especially for the workplace and education. In this Dagstuhl Perspectives Workshop, we want to focus on the evaluation of these conversational systems, as appropriate methods are still missing. The quality of these systems is limited in terms of personalization, veracity and correctness, bias, transparency, trustworthiness, and understandability. Thus, evaluation methods must address these shortcomings. Furthermore, user- and usage-oriented aspects should become a more prominent and integral component in evaluations, as the user population as well as the tasks these systems are used for become more heterogeneous. For this reason, the topic-centric view of relevance has to be extended to a broad range of facets which are important for the different usage scenarios. Therefore, suitable evaluation criteria have to be specified, which form the basis for defining appropriate measures. Most importantly, the range of evaluation methods must be revisited and extended, as popular methods like the Cranfield approach or crowdsourcing must be complemented by new evaluation methods and strategies specifically tailored to this new type of system.

More in detail, we will focus our discussion on several key open issues, among which are the following:

how to cross the borders of different areas, mainly Information Retrieval and Recommender Systems in our case, but also Natural Language Processing;
how to create experimental collections and evaluate Large Language Models in terms of their bias, explainability, veracity, correctness, and hallucination in the CA context;
how to incorporate user- and usage-oriented facets in order to understand how users’ perceived conversation qualities (e.g., attentiveness, adaptability, understanding, and response quality) and perceived recommendation qualities (e.g.,, accuracy, novelty, interaction adequacy, and explanation) might interact with each other in a CA to affect user beliefs (e.g., perceived usefulness, perceived ease of use, transparency, user control, rapport, humanness), user attitudes (e.g., user satisfaction, trust), and behavioral intentions (e.g., intention to use);
how to measure information leakage and privacy, and how to ensure that a CA does not propagate sensitive information;
how to devise proper simulation approaches to support both the development and the evaluation of a CA, avoiding circularity (the techniques used for simulation are similar to those used for developing systems), ensuring reliability, and reducing the gap between offline measurements and online user evaluations;
how to evaluate to what extent answers/recommendations produced by a CA are appropriate, tailored to, and understandable for a specific audience, e.g., school kids, the general public, professionals, and people with (cognitive) disabilities.

Overall, all the above questions call, as one possible output of the workshop, for envisioning some reference architecture for CA systems, geared towards evaluation, which allows the different areas to cooperate on a common ground and to share a common roadmap for improving our understanding of CA systems and making them more effective.

Creative Commons BY 4.0

Christine Bauer, Li Chen, Nicola Ferro, and Norbert Fuhr

Participants

Show Participants

Please log in to DOOR to see more details.

Avishek Anand (TU Delft, NL) [dblp]
Christine Bauer (Paris Lodron Universität Salzburg, AT) [dblp]
Timo Breuer (TH Köln, DE) [dblp]
Li Chen (Hong Kong Baptist University, HK) [dblp]
Guglielmo Faggioli (University of Padova, IT) [dblp]
Nicola Ferro (University of Padova, IT) [dblp]
Ophir Frieder (Georgetown University - Washington, DC, US) [dblp]
Norbert Fuhr (Universität Duisburg-Essen, DE) [dblp]
Hideo Joho (University of Tsukuba - Ibaraki, JP) [dblp]
Jussi Karlgren (Silo Ai - Helsinki, FI) [dblp]
Johannes Kiesel (Bauhaus-Universität Weimar, DE) [dblp]
Bart Knijnenburg (Clemson University, US) [dblp]
Aldo Lipani (University College London, GB) [dblp]
Lien Michiels (University of Antwerp, BE) [dblp]
Andrea Papenmeier (University of Twente, NL) [dblp]
Sole Pera (TU Delft, NL) [dblp]
Mark Sanderson (RMIT University - Melbourne, AU) [dblp]
Scott Sanner (University of Toronto, CA) [dblp]
Benno Stein (Bauhaus-Universität Weimar, DE) [dblp]
Johanne Trippas (RMIT University - Melbourne, AU) [dblp]
Karin Verspoor (RMIT University - Melbourne, AU) [dblp]
Martijn Willemsen (Eindhoven University of Technology, NL & JADS, NL) [dblp]

Classification

Artificial Intelligence
Human-Computer Interaction
Information Retrieval

Keywords

Conversational Agents
Information Retrieval
Recommender Systems
Evaluation
User Interaction

Seminar 24352

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Perspectives Workshop 24352

Conversational Agents: A Framework for Evaluation (CAFE)

( Aug 25 – Aug 30, 2024 )

Permalink

Organizers

Contact

Shared Documents

Publications

Summary

Motivation

Participants

Classification

Keywords