Dagstuhl-Seminar 24391
Statistical and Probabilistic Methods in Algorithmic Data Analysis
( 22. Sep – 27. Sep, 2024 )
Permalink
Organisatoren
- Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE)
- Matteo Riondato (Amherst College, US)
- Eli Upfal (Brown University - Providence, US)
Kontakt
- Andreas Dolzmann (für wissenschaftliche Fragen)
- Jutka Gasiorowski (für administrative Fragen)
Dagstuhl Reports
As part of the mandatory documentation, participants are asked to submit their talk abstracts, working group results, etc. for publication in our series Dagstuhl Reports via the Dagstuhl Reports Submission System.
- Upload (Use personal credentials as created in DOOR to log in)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
Gemeinsame Dokumente
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve desirable scalability and accuracy guarantees. At the same time, modern data-analysis tasks require the use of advanced statistics to handle challenges such as testing for multiple hypotheses or identifying dependencies among data points, such as in time series or graphs. Probabilistic methods are also at the core theoretical computer-science areas, such as sublinear algorithms and average-case analysis. To obtain efficient data-analysis algorithms, probabilistic methods require careful balancing of theoretical and practical aspects. This Dagstuhl Seminar brings together researchers interested in statistical and probabilistic methods to design and analyze scalable algorithms for discovering knowledge in large and rich datasets. We plan to cover the following topics, among others.
Approximation Algorithms for Data Analysis: Exploring data quickly is important to guide further analysis. Very large datasets will not fit in the main memory or even the disk of a single machine, making exploratory analysis excessively slow and cumbersome. Approximation algorithms trade off the accuracy of the answer for the speed in which the approximate answer is obtained. Such solutions are acceptable when they offer stringent theoretical guarantees on their quality.
Data Stream and Sketches: In many applications new data arrive at all times: e.g., in online social networks new edges, representing interactions between users, arrive continuously. Data deletions can also be considered. Such dynamic settings are modeled nicely by data streams. The computational challenge is to use limited memory to maintain different statistics (e.g., subgraph counts, patterns, etc.) always up to date as the stream evolves with insertions and deletions. The best solutions for these problems use sketches, such as samples and/or coresets, i.e., a succinct representations of the data seen so far (or of the relevant data, if the computation is to be performed over a sliding window), to be used for computing a good approximation of the quantities of interest. Their development usually requires the analysis of statistical properties by studying the probability distribution of the content of the sketch.
Mining Temporal Data and Time Series: There is a growing interest in mining temporal data, i.e., data that is generated by a process that varies in time. This type of data is not limited to “traditional” time series, but also includes more complex types of data, such as temporal networks. In contrast to the extensive work on model-based time-series analysis, in data mining the focus is on non-parametric, model-agnostic, analysis. For instance, to estimate the current demand for products based on their past demand, it is often unrealistic to assume that the underlying process is time invariant. Instead, we search for patterns and trends in the data with minimal prior assumptions on the stochastic data-generation process, and taking into account the possibility of unknown distribution drift. The challenge, in this case, would be to bound the error in the computation of patterns or trends in terms of minimal conditions on the drift. In temporal graphs, the focus is on computing graph characteristics and/or vertex or edge scores (such as centralities) defined over multiple snapshots of a graph.
Statistically-Sound Knowledge Discovery from Data: In statistically-sound Knowledge Discovery from Data (KDD), results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the random, partially unknown process that generated the data. The challenges in developing methods for Statistically-sound KDD include (1) understanding how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the dataset size to the number and complexity of hypotheses.
- Florian Adriaens (University of Helsinki, FI) [dblp]
- Aris Anagnostopoulos (Sapienza University of Rome, IT) [dblp]
- Luca Becchetti (Sapienza University of Rome, IT) [dblp]
- Fabian Brandt-Tumescheit (HU Berlin, DE) [dblp]
- Gianmarco De Francisci Morales (CENTAI - Torino, IT) [dblp]
- Alessandro Epasto (Google - New York, US) [dblp]
- Esther Galbrun (University of Kuopio, FI) [dblp]
- Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
- Shahrzad Haddadan (Rutgers Business School - Piscataway, US) [dblp]
- Ravi Kumar (Google - Mountain View, US) [dblp]
- Silvio Lattanzi (Google - Barcelona, ES) [dblp]
- Stefano Leonardi (Sapienza University of Rome, IT) [dblp]
- Sebastian Lüderssen (TU Wien, AT)
- Alessio Mazzetto (Brown University - Providence, US) [dblp]
- Pauli Miettinen (University of Kuopio, FI) [dblp]
- Stefan Neumann (Technische Universität Wien, AT) [dblp]
- Lutz Oettershagen (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
- Rasmus Pagh (University of Copenhagen, DK) [dblp]
- Andrea Pietracaprina (University of Padova, IT) [dblp]
- Giulia Preti (CENTAI - Torino, IT) [dblp]
- Geppino Pucci (University of Padova, IT) [dblp]
- Matteo Riondato (Amherst College, US) [dblp]
- Will Rosenbaum (University of Liverpool, GB) [dblp]
- Larry Rudolph (Two Sigma Investments LP - New York, US) [dblp]
- Ilie Sarpe (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
- Mauro Sozio (Telecom Paris, FR) [dblp]
- Mahito Sugiyama (National Institute of Informatics - Tokyo, JP) [dblp]
- Nikolaj Tatti (University of Helsinki, FI) [dblp]
- Eli Upfal (Brown University - Providence, US) [dblp]
- Fabio Vandin (University of Padova, IT) [dblp]
- Sergei Vassilvitskii (Google - New York, US) [dblp]
- Yllka Velaj (Universität Wien, AT) [dblp]
- Geoffrey Webb (Monash University - Clayton, AU) [dblp]
- Anthony Wirth (The University of Sydney, AU) [dblp]
Klassifikation
- Data Structures and Algorithms
- Databases
- Social and Information Networks
Schlagworte
- Data Mining
- Knowledge Discovery from Data
- Probabilistic Analysis
- Randomized Algorithms
- Statistical Hypothesis Testing