Dagstuhl Seminar 24032
Representation, Provenance, and Explanations in Database Theory and Logic
( Jan 14 – Jan 19, 2024 )
Permalink
Organizers
- Pablo Barcelo (PUC - Santiago de Chile, CL)
- Pierre Bourhis (CNRS - CRIStAL, Lille, FR)
- Stefan Mengel (CNRS, CRIL - Lens, FR)
- Sudeepa Roy (Duke University - Durham, US)
Contact
- Andreas Dolzmann (for scientific matters)
- Jutka Gasiorowski (for administrative matters)
Shared Documents
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Schedule
Background and Research area
The Dagstuhl Seminar “Representation, Provenance, and Explanations in Database Theory and Logic” (24032) was broadly in Database Theory, where the goal is to formalize the theoretical underpinnings of databases and then analyze them with mathematical tools. One of the most fundamental problems in both database theory and systems is efficient query evaluation: given a database and a query, compute the answer to the query on the database. This question has a tight connection to logic, since it has been known for a long time that different fragments of first- or second-order logic can be seen as the core of practical query languages like SQL or Datalog. This seminar focused on three key aspects of query evaluations: representation, provenance, and explanations.
Representation. For large datasets, query results can be very large when they are materialized explicitly in the standard form. For efficient query processing and subsequent applications, it is important to represent the query answers in a compact fashion. One important form of representations in query evaluation is by circuits, which have a long history in complexity theory and AI and can be seen as part of the larger framework of knowledge compilation (Darwiche, Marquis, J. Artif. Intell. Res. 2002). Circuits were heavily discussed in several presentations in the seminar. The other aspect of representation that the seminar focused on was the field of enumeration algorithms and direct access. It first computes a data structure representing the query answers, and then gives an algorithm to extract one answer at a time from the data structure. In this problem, the complexity of the two parts is measured separately: the computation time of the data structure is called the preprocessing time and the time of the extraction of each answer is called the delay. Typically, the goal of such algorithms is to have a preprocessing time much smaller than the cost of the classical evaluation of the query and very small (ideally constant) delay.
Provenance. Data provenance in general refers to how the outputs of a query are generated from the inputs, with a broad goal to enable interpretability, trust, and reproducibility of the queries. A mathematical form of provenance that propagates annotations of inputs to the outputs, called provenance semirings, was proposed in a seminal work by Green et al. (PODS 2007). The most specialized case of Boolean semirings captures how an output tuple has been obtained from the inputs with joint usage (joins – translate to conjunctions ∨), and alternative usages (projections or unions – translate to disjunctions ∨). Such semirings can be used to understand compactly how outputs are generated from inputs, and have applications in query evaluation in probabilistic databases when realization of inputs tuples is uncertain (Dalvi-Suciu, JACM 2012), and in deletion propagation or view update, to understand how the outputs change if one or more inputs are deleted, without re-computing the query. There are more advanced semirings like tropical semirings that can capture shortest paths in graphs. Compact and efficient knowledge compilations of provenance circuits into ordered and free binary decision diagrams (OBDDs, FBDDs), and more generally as decomposable deterministic negation normal forms (d-DNNF) are also important questions in database theory with applications in probabilistic databases (Jha-Suciu, ICDT 2011; Beame et al., ACM Trans. Database Syst. 2017; Monet, PODS 2020).
Explanations. While provenance provides one approach to explaining query answers capturing how the query answers are generated, in many applications, other forms of insights as explanations are desired for understanding contributions of inputs, trends and anomalies in the outputs, and deciding next course of actions or recourse. Recently, explanations based on the widely known Shapley values from co-operative game theory have been used in database theory to measure the relevance of a certain database fact to a query answer (Deutch et al, SIGMOD 2022; Livshits et al., ICDT 2021), and to measure the relevance of inputs to the outcome of an ML classifier (Arenas et al., AAAI 2021). Complexity, applications, and algorithms for explanations by Shapley values were heavily discussed in the seminar. Since the naive computation of Shapley values is intractable as it includes a summation over exponentially many subsets, one of the main themes behind this investigation has been the identification of practically relevant classes of database queries for which such explanations can be computed in polynomial time, possibly using knowledge compilation forms. Apart from Shapley values, other forms of explanations, including that of aggregated database query answers (e.g., Roy-Suciu, SIGMOD’14) and connections of explanations with data privacy, fairness, and causal inference were discussed in the seminar. This way the seminar connected the field of database theory to the field of responsible data science that is of paramount importance in real world.
Acknowledgements
We are grateful to the Scientific Directorate and to the staff of the Schloss Dagstuhl – Leibniz Center for Informatics for their support of this seminar.
The goal of database theory is to formalize the theoretical underpinnings of databases and then analyze them with mathematical tools. One central question in the area is query evaluation, that is the computational problem of computing the answer to a query in a database. Of particular importance in this area is understanding the complexity of query evaluation, i.e., which resources, in particular, how much time and space, are necessary and sufficient to compute the answers to a given query. This Dagstuhl Seminar will focus on three key aspects of query evaluation: representation, provenance, and explanations. One common theme of these three directions is the use of circuits, which have a long history in complexity theory. In recent years, the theory of query processing has witnessed the use of different types of circuits, sometimes over Boolean inputs and operations, but also over more general structures, which often can be formalized as different semirings. We hope to see use of circuits in different representations of inputs for expressing and computing provenance, as well as for computing explanations using scores such as Shapley values. We hope to strengthen our understandings of these topics within database theory and logic. The topics to be discussed in the seminar are the following:
(1)Representation: One of the most important basics of algorithms is the use of good data-structures. There is a general tradeoff between expressivity, compactness, and efficient usability. Finding the right compromise in this tradeoff is often crucial for having efficient algorithms. These general considerations also apply to query evaluation: since in certain situations query results can be very large when materialized extensionally, and therefore, for many efficient query answering algorithms it is important to represent them in a compact fashion. For example, this is often the key underlying elements of enumeration algorithms, sampling, access to query results under access patterns, or algorithms providing so-called direct access to query results.
(2) Provenance: It is known that the query semantics for First Order Logic admits an interpretation in an arbitrary semiring. The most specialized case of Boolean semirings denote how an output tuple has been obtained from the inputs with joint usage (joins – translate to conjunctions "∧"), and alternative usages (projections or unions – translate to disjunctions "∨"). Such semirings can be used to understand compactly how outputs are generated from inputs, and have applications in query evaluation in uncertain data and in deletion propagation – to understand how the outputs change if one or more inputs are deleted, without re-computing the query. There are more advanced semirings like tropical semirings that can capture minimum cost in network flows. Compact representations of provenance semirings for recursive queries using circuits have also been studied, although the complexity of recursive queries under semiring semantics is open (despite its major importance in modern systems for machine learning (ML) where tensor algebra is extended with recursion). Compact and efficient knowledge compilation of provenance circuits into ordered and free binary decision diagrams, and more generally as decomposable deterministic negation normal forms (OBDD, FBDD, DNNF), are also important questions in database theory with applications in probabilistic databases.
(3) Explanations: As of late, explanations based on the widely known Shapley values from co-operative game theory have been used in database theory to measure the relevance of a certain database fact to a query answer, and to measure the relevance of inputs to the outcome of an ML classifier. Since the naive computation of Shapley values is intractable, as it includes a summation over exponentially many subsets, one of the main themes behind this investigation has been the identification of practically relevant classes of database queries for which such explanations can be computed in polynomial time. As it has been recently noted, all such positive results can be seen as particular instances of the fact that for a certain well-behaved class of circuits, namely deterministic and decomposable circuits, Shapley-values can be computed efficiently. The study of the complexity of computing Shapley values on different classes of circuits is an active area of research in AI, and we strongly believe that establishing bridges between this work and the one carried out in database theory can be strongly beneficial for both communities. There are other interesting notions of explanations for databases, ML, and aspects of responsible data science like fairness, which we will also explore in this Dagstuhl Seminar.
- Antoine Amarilli (Telecom Paris, FR) [dblp]
- Albert Atserias (UPC Barcelona Tech, ES) [dblp]
- Pablo Barcelo (PUC - Santiago de Chile, CL) [dblp]
- Christoph Berkholz (TU Ilmenau, DE) [dblp]
- Leopoldo Bertossi (SKEMA Business School - Montréal, CA) [dblp]
- Pierre Bourhis (CNRS - CRIStAL, Lille, FR) [dblp]
- Florent Capelli (University of Artois/CNRS - Lens, FR) [dblp]
- Wolfgang Gatterbauer (Northeastern University - Boston, US)
- Floris Geerts (University of Antwerp, BE) [dblp]
- Amir Gilad (The Hebrew University of Jerusalem, IL)
- Boris Glavic (University of Illinois - Chicago, US)
- Benny Kimelfeld (Technion - Haifa, IL) [dblp]
- Ester Livshits (University of Edinburgh, GB) [dblp]
- Bertram Ludäscher (University of Illinois at Urbana-Champaign, US) [dblp]
- Stefan Mengel (CNRS, CRIL - Lens, FR) [dblp]
- Mikaël Monet (INRIA Lille, FR)
- Liat Peterfreund (The Hebrew University of Jerusalem, IL) [dblp]
- Reinhard Pichler (TU Wien, AT) [dblp]
- Cristian Riveros (PUC - Santiago de Chile, CL) [dblp]
- Sudeepa Roy (Duke University - Durham, US) [dblp]
- Babak Salimi (University of California, San Diego - La Jolla, US)
- Pierre Senellart (ENS, PSL University - Paris, FR) [dblp]
- Christoph Standke (RWTH Aachen, DE)
- Dan Suciu (University of Washington - Seattle, US) [dblp]
- Nikolaos Tziavelis (Northeastern University - Boston, US)
- Harry Vinall-Smeeth (TU Ilmenau, DE)
Classification
- Data Structures and Algorithms
- Databases
- Logic in Computer Science
Keywords
- Provenance
- Factorized Databases
- Shapley values
- Database Theory
- Computational Complexity