Dagstuhl Seminar 25182
Challenges and Opportunities of Table Representation Learning
( Apr 27 – May 02, 2025 )
Permalink
Organizers
- Carsten Binnig (TU Darmstadt, DE)
- Julian Martin Eisenschlos (Google Research - Zürich, CH)
- Madelon Hulsebos (CWI - Amsterdam, NL)
- Frank Hutter (Universität Freiburg, DE)
Contact
- Michael Gerke (for scientific matters)
- Susanne Bach-Bernhard (for administrative matters)
The increasing amount of data being collected, stored, and analyzed induces a need for efficient, scalable, and robust methods to handle this data. Representation learning, i.e. the practice of leveraging neural networks to obtain generic representations of data objects, has been shown effective for various applications over data modalities such as images and text. More recently, representation learning has shown initial impressive capabilities on structured data (e.g. relational tables in databases), for a limited set of tasks in data management and analysis, such as data cleaning, insight retrieval, and data analytics. Most applications traditionally relied on heuristics and statistics, which are limited in robustness, scale, and accuracy. The ability to learn abstract representations across tables unlocked new opportunities, such as pretrained models for data augmentation and machine learning, that address these limitations. This emerging research area, which we refer to as Table Representation Learning (TRL), receives increasing interest from industry as well as academia , in particular in the communities of data management, machine learning, and natural language processing.
This growing interest is a result of the high potential impact of TRL in industry given the abundance of tables in the organizational data landscape, the large range of high-value applications relying on tables, and the early state of TRL research so far. That is, recently, specialized TRL models for embedding (relational) tables as well as prompting methods for LLMs over structured data residing in databases have been developed and shown effective for various tasks, e.g. data preparation, machine learning, and question answering. However, studies have revealed shortcomings of existing models regarding their ability to capture the structure of tables , the relationships among tables, the heterogeneity (e.g. numbers, dates, text), biases and semantics of the data contents, limited generalization to new domains, unaddressed privacy constraints, etc. These challenges are merely the first limitations surfaced so far and we expect to identify more limitations of existing approaches through discussions, talks, and hands-on sessions at the TRL Dagstuhl Seminar.
As we stand at the starting point of developing and adopting high-capacity neural models (e.g. through representation or generative learning) for structured data, there is a wide range of applications that have not been addressed yet. For example, pretrained models for tabular machine learning have been explored only to a limited extent, whereas “upstream” data management applications, such as automated data validation and query and schema optimization, have not been explored so far. Therefore, another objective of this Dagstuhl Seminar is to identify novel application areas, build first prototypes to assess the potential, and develop research agendas towards further exploration of these applications. Moreover, beyond these unexplored applications, we aim to develop a manifesto that brings forward a common long-term vision for TRL with moon-shot ideas and the road to get there, which requires perspectives from experts in academia and industry.
Classification
- Artificial Intelligence
- Computation and Language
- Databases
Keywords
- Representation and Generative Learning for Data Management and Analysis
- Applications of Table Representation Learning
- Benchmarks and Datasets for Table Representation Learning
- Pre-trained (Language) Models for Tables and Databases