Dagstuhl Seminar 26161
Managing Vector Data for Retrieval Augmented Generation: Systems and Algorithms
( Apr 12 – Apr 17, 2026 )
Permalink
Organizers
- Sihem Amer-Yahia (University of Grenoble, FR)
- Arijit Khan (Aalborg University, DK)
- Wolfgang Lehner (TU Dresden, DE)
- Sharad Mehrotra (University of California - Irvine, US)
- Wenjie Zhang (UNSW - Sydney, AU)
Contact
- Marsha Kleinbauer (for scientific matters)
- Simone Schilke (for administrative matters)
There is a surge of dense, high-dimensional, billion-scale vector data generated by deep learning models that embed complex, multi-modal data, including text, multimedia, graphs, and tables into vector representations aiming to preserve semantically meaningful information, for several downstream tasks, e.g., question answering, recommendation, video search, drug design, and other data science applications. Vector DataBases (VectorDBs) are optimized specifically for the storage and management of high dimensional vectors. Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model (LLM), so it references an authoritative knowledge base outside of its training data sources before generating a response. RAG and VectorDBs are two important concepts in natural language processing (NLP) and multi-modal data management that are pushing the boundaries of what AI systems can achieve. A critical aspect that powers the capabilities of RAG models is the vector database which stores the embeddings for fast semantic search during the initial retrieval stage. For RAG models to scale to a huge corpus containing billions of text passages and multi-modal knowledge graphs, effective and efficient model fine-tuning, indexing, and querying of vector representations are crucial. This is where highly optimized vector databases, e.g., Weaviate, Chroma, FAISS, Vespa, or Pinecone come into play. They allow storing billions of entity or document vectors for low-latency similarity search.
More generally, the use of VectorDBs to power RAG addresses emerging critical problems such as how to generate vector data effectively fusing multi-modal information; when geometry of the data preserves semantic information; how to update them dynamically; efficient storage, indexing, visualization, scalable querying, and explanation; preserving privacy and fairness. This Dagstuhl Seminar aims to bring together researchers from the emerging areas of RAG, VectorDBs, systems, and applications – providing opportunities for interdisciplinary progress. We plan to have a traditional mix of invited (and thus well-prepared) presentations, both from academia and industry, as well as breakout sessions, a panel, a demo and/or poster session, a gong show session having 5-minute talks for participants who would like to showcase their relevant and ongoing research works, visionary ideas, etc., thereby initiating more discussion and cross-disciplinary collaboration.
Topics (non-exhaustive):
- Vector data generation, geometry, dynamic update
- Retrieval-augmented generation
- Store and index vector data for RAG
- Vector databases for knowledge modeling and cross-modal data retrieval
- Query optimization in vector databases
- Software-hardware collaborative approaches and cloud data management for vector data
- Access control, privacy, fairness, data regulations, adding human-in-the-loop, and explainability in vector data management
- Applications of vector data, RAG, and LLMs
Classification
- Databases
Keywords
- Vector databases
- Vector data management
- Data management and Machine Learning
- Retrieval Augment Generation
- Similarity search