Dagstuhl Seminar 24102
Shapes in Graph Data: Theory and Implementation
( Mar 03 – Mar 08, 2024 )
Permalink
Organizers
- Shqiponja Ahmetaj (TU Wien, AT)
- Slawomir Staworko (relationalAI - Berkeley, US)
- Jan Van den Bussche (Hasselt University, BE)
Contact
- Marsha Kleinbauer (for scientific matters)
- Susanne Bach-Bernhard (for administrative matters)
Shared Documents
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Schedule
Research Area and Goals of the Seminar
One of the main reasons for the success of graph databases is that they do not require an elaborate database schema, with accompanying integrity constraints, to be set up in advance. In these classical applications, constraints and schemas are mainly descriptive, having as purpose to support the mental map from the real world to the data to be managed in the database. However, the emergence of graph databases is accompanied by a paradigm shift towards new applications where schemas and constraints are used for a prescriptive purpose. Here, the goal is to establish a contract between the database and its users, which provides guarantees on the structure and form of data provided. This shift has led to the development of a new class of formalisms based on the notion of shapes. Shapes are constraints on nodes in the graph that impose or forbid structural patterns (involving paths, edges, labels, and constant values). Naturally, then, a novel, prescriptive notion of schema emerges, consisting of a set of shapes, together with a targeting mechanism that specifies which nodes should satisfy which shapes. In the world of RDF graphs, two main shape-based formalisms have been proposed: SHACL (Shapes Constraint Language), standardized by the W3C, and ShEx (Shape Expression schemas). In the world of property graphs (PGs), different systems have their own data definition languages, such as Cypher or GSQL. Moreover, there are recent formal approaches to define schemas for property graphs such as PG-Schema and PG-Keys. The main aim of the Dagstuhl Seminar was to bring together active researchers, both from academia and industry, to report on the most recent results, to discuss the many open problems and research directions that arise from shapes, constraints, and schemas for graph databases, and to initiate new research.
Organization and Outcomes
The organisers created a schedule based on the entries from a Google document set up before the seminar, inviting participants to add talks, demos, and research topics. The seminar began with a round of introductions, where participants also asked questions they wanted to be answered during the seminar. The final schedule included 18 contributed talks and 6 short presentations on potential research and discussion topics.
As a major result from the seminar, four working groups were formed on the topics:
- What is used in practice for graph data abstractions? What is needed in practice for graph data abstractions? The group formation was inspired by related questions posed by many participants during the opening introductory round on the first day of the seminar. Several research challenges were discussed and addressing them will call for opening new human-centered research lines in the data management community and beyond.
- Repairs and explanations in knowledge graph data management systems in the presence of shape constraints. The group discussed the problem of assessing and managing data quality in knowledge graphs (KGs). This is a long-standing issue that attracts significant attention both in industry and academia. The new proposals on schemas and shape languages for KGs have introduced new challenges, which involve new methods to verify their validity, to deal with inconsistency, and repair the inconsistent data.
- Relating 6NF (Sixth Normal Form) and PG-Schema. In this working group, two main questions were discussed: (1) Can we show in a systematic manner how schemas for property graphs, as expressed in the proposals of PG-Schema and PG-Keys, can be represented relationally, obtaining highly decomposed (6NF) schemas with key constraints and inclusion constraints such as foreign keys? (2) Can the intent of a graph database application be formalized in a suitable variant of EER (extended Entity-Relationship) diagrams?
- Convergence of graph data models and schemas. The goal of the group was to understand the commonalities and differences between RDF and LPG (labelled property graphs), and their corresponding schema languages, ShEx and SHACL for RDF, and PG-Schema for LPG. The aim is to identify a common core (a small but useful common sublanguage, easily expressible in all three formalisms) and a common superlanguage (a language that captures all three formalisms, yet remains manageable).
The organisers regard the seminar as a very successful scientific event. Members of each working group expressed a clear commitment to staying connected to further investigate these topics. The first two groups specify a vision paper as a specific goal and the result of the group’s future efforts and the second two groups aim to produce research papers.
The organisers are grateful to the Scientific Directorate and to the staff for supporting in making this seminar possible.
Over the past decade, graph databases such as RDF and property graphs have gained significant traction. The time is ripe to bring people together around shapes, and, more generally, flexible, and expressive schema and constraint languages for graph databases. We explain our motivation in what follows. One of the main reasons for the success of graph databases is that they do not require an elaborate database schema, with accompanying integrity constraints, to be set up in advance. Of course, the principles of conceptual and logical database design remain as valuable as ever for critical enterprise applications. In these classical applications, constraints and schemas are mainly descriptive, having as purpose to support the mental map from the real world to the data to be managed in the database. In contrast, the emergence of graph databases is accompanied by a paradigm shift towards new applications where schemas and constraints are used for a prescriptive purpose. Here, the goal is to establish a contract between the database and its users, which provides guarantees on the structure and form of data provided, and imposes restrictions required for data governance. There is no single schema; instead, schemas are developed “as-you-go” and are adapted depending on the usage of the data.
The need for flexible languages for writing prescriptive schemas was felt rather quickly. Such languages need to have a sound formal underpinning, if they are to be used for improving and ensuring data quality, and for static analysis and verification of graph database transformations. Soon, a new class of formalisms based on the notion of shape emerged. Shapes are constraints on nodes in the data graph that impose or forbid structural patterns (involving paths, edges, labels, and constant values). Importantly, shapes can refer to other shapes. Naturally, then, a novel, prescriptive notion of schema emerges, consisting of a set of shapes, together with an assignment or targeting mechanism that specifies which nodes should satisfy which shapes. In the world of RDF graphs, two main shape-based formalisms have been proposed: SHACL (Shapes Constraint Language), standardized by the W3C, and ShEx (Shape Expression schemas). In the world of property graphs, different systems have their own data definition languages, such as Cypher or GSQL. The ISO/IEC working group is currently engaged in standardizing property graphs within the SQL/PGQ and GQL project. Together with the Linked Data Benchmark Council, they are developing property graph schema formalisms that feature a shape-based type system.
Our aim is to bring together the leading researchers on shapes, schemas, and constraints for graph data, both from academia and industry, to discuss the many open problems. The purpose of this Dagstuhl Seminar is to inform each other on how we perceive the research area; to report on brand new results; to discuss open problems and future directions; and to initiate new research.
Focus topics vary along different research axes that arise from shapes, constraints, and schemas for graph databases, and especially on:
- Expressive power and complexity
- Implementation and processing strategies
- Automated reasoning about shapes
- Explaining and handling violations
Participants are encouraged to provide details regarding their research interests and preferred discussion topics before the seminar starts. Furthermore, it is suggested that they explore the information shared by the other participants beforehand.
- Shqiponja Ahmetaj (TU Wien, AT)
- Iovka Boneva (Université de Lille I, FR) [dblp]
- Angela Bonifati (Université Claude Bernard - Lyon, FR & IUF - Paris, FR) [dblp]
- Anastasia Dimou (KU Leuven, BE)
- Stefania Dumbrava (ENSIIE - Paris, FR) [dblp]
- Nicolas Ferranti (Wirtschaftsuniversität Wien, AT)
- George Fletcher (TU Eindhoven, NL) [dblp]
- Benoit Groz (University Paris-Saclay - Orsay, FR)
- Jan Hidders (Birkbeck, University of London, GB) [dblp]
- Katja Hose (TU Wien, AT) [dblp]
- Maxime Jakubowski (Hasselt University, BE)
- George Konstantinidis (University of Southampton, GB)
- José Emilio Labra Gayo (University of Oviedo, ES)
- Aurélien Lemay (INRIA Lille, FR) [dblp]
- Leonid Libkin (University of Edinburgh, GB) [dblp]
- Wim Martens (Universität Bayreuth, DE) [dblp]
- Fabio Mogavero (University of Naples, IT) [dblp]
- Filip Murlak (University of Warsaw, PL) [dblp]
- Cem Okulmus (University of Umeå, SE)
- Nina Pardal (University of Sheffield, GB)
- Liat Peterfreund (The Hebrew University of Jerusalem, IL) [dblp]
- Axel Polleres (Wirtschaftsuniversität Wien, AT) [dblp]
- Ognjen Savkovic (Freie Universität Bozen, IT)
- Mantas Simkus (TU Wien, AT)
- Slawomir Staworko (relationalAI - Berkeley, US)
- Katherine Thornton (Yale University Library - New Haven, US) [dblp]
- Jan Van den Bussche (Hasselt University, BE) [dblp]
- Maria-Esther Vidal (TIB - Hannover, DE) [dblp]
- Hannes Voigt (Neo4j - Leipzig, DE) [dblp]
- Piotr Wieczorek (University of Wroclaw, PL)
Classification
- Databases
- Logic in Computer Science
Keywords
- data for the Semantic Web
- schema languages
- constraint languages
- graph data