TOP
Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminars
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Publishing
Within this website:
External resources:
dblp
Within this website:
External resources:
  • the dblp Computer Science Bibliography


Dagstuhl Seminar 24391

Statistical and Probabilistic Methods in Algorithmic Data Analysis

( Sep 22 – Sep 27, 2024 )

(Click in the middle of the image to enlarge)

Permalink
Please use the following short url to reference this page: https://www.dagstuhl.de/24391

Organizers

Contact

Shared Documents

Schedule

Motivation

Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve desirable scalability and accuracy guarantees. At the same time, modern data-analysis tasks require the use of advanced statistics to handle challenges such as testing for multiple hypotheses or identifying dependencies among data points, such as in time series or graphs. Probabilistic methods are also at the core theoretical computer-science areas, such as sublinear algorithms and average-case analysis. To obtain efficient data-analysis algorithms, probabilistic methods require careful balancing of theoretical and practical aspects. This Dagstuhl Seminar brings together researchers interested in statistical and probabilistic methods to design and analyze scalable algorithms for discovering knowledge in large and rich datasets. We plan to cover the following topics, among others.

Approximation Algorithms for Data Analysis: Exploring data quickly is important to guide further analysis. Very large datasets will not fit in the main memory or even the disk of a single machine, making exploratory analysis excessively slow and cumbersome. Approximation algorithms trade off the accuracy of the answer for the speed in which the approximate answer is obtained. Such solutions are acceptable when they offer stringent theoretical guarantees on their quality.

Data Stream and Sketches: In many applications new data arrive at all times: e.g., in online social networks new edges, representing interactions between users, arrive continuously. Data deletions can also be considered. Such dynamic settings are modeled nicely by data streams. The computational challenge is to use limited memory to maintain different statistics (e.g., subgraph counts, patterns, etc.) always up to date as the stream evolves with insertions and deletions. The best solutions for these problems use sketches, such as samples and/or coresets, i.e., a succinct representations of the data seen so far (or of the relevant data, if the computation is to be performed over a sliding window), to be used for computing a good approximation of the quantities of interest. Their development usually requires the analysis of statistical properties by studying the probability distribution of the content of the sketch.

Mining Temporal Data and Time Series: There is a growing interest in mining temporal data, i.e., data that is generated by a process that varies in time. This type of data is not limited to “traditional” time series, but also includes more complex types of data, such as temporal networks. In contrast to the extensive work on model-based time-series analysis, in data mining the focus is on non-parametric, model-agnostic, analysis. For instance, to estimate the current demand for products based on their past demand, it is often unrealistic to assume that the underlying process is time invariant. Instead, we search for patterns and trends in the data with minimal prior assumptions on the stochastic data-generation process, and taking into account the possibility of unknown distribution drift. The challenge, in this case, would be to bound the error in the computation of patterns or trends in terms of minimal conditions on the drift. In temporal graphs, the focus is on computing graph characteristics and/or vertex or edge scores (such as centralities) defined over multiple snapshots of a graph.

Statistically-Sound Knowledge Discovery from Data: In statistically-sound Knowledge Discovery from Data (KDD), results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the random, partially unknown process that generated the data. The challenges in developing methods for Statistically-sound KDD include (1) understanding how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the dataset size to the number and complexity of hypotheses.

Copyright Aristides Gionis, Matteo Riondato, and Eli Upfal

Participants

Please log in to DOOR to see more details.

  • Florian Adriaens (University of Helsinki, FI) [dblp]
  • Aris Anagnostopoulos (Sapienza University of Rome, IT) [dblp]
  • Luca Becchetti (Sapienza University of Rome, IT) [dblp]
  • Fabian Brandt-Tumescheit (HU Berlin, DE) [dblp]
  • Gianmarco De Francisci Morales (CENTAI - Torino, IT) [dblp]
  • Alessandro Epasto (Google - New York, US) [dblp]
  • Esther Galbrun (University of Kuopio, FI) [dblp]
  • Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
  • Shahrzad Haddadan (Rutgers Business School - Piscataway, US) [dblp]
  • Ravi Kumar (Google - Mountain View, US) [dblp]
  • Silvio Lattanzi (Google - Barcelona, ES) [dblp]
  • Stefano Leonardi (Sapienza University of Rome, IT) [dblp]
  • Sebastian Lüderssen (TU Wien, AT)
  • Alessio Mazzetto (Brown University - Providence, US) [dblp]
  • Pauli Miettinen (University of Kuopio, FI) [dblp]
  • Stefan Neumann (Technische Universität Wien, AT) [dblp]
  • Lutz Oettershagen (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
  • Rasmus Pagh (University of Copenhagen, DK) [dblp]
  • Andrea Pietracaprina (University of Padova, IT) [dblp]
  • Giulia Preti (CENTAI - Torino, IT) [dblp]
  • Geppino Pucci (University of Padova, IT) [dblp]
  • Matteo Riondato (Amherst College, US) [dblp]
  • Will Rosenbaum (University of Liverpool, GB) [dblp]
  • Larry Rudolph (Two Sigma Investments LP - New York, US) [dblp]
  • Ilie Sarpe (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
  • Mauro Sozio (Telecom Paris, FR) [dblp]
  • Mahito Sugiyama (National Institute of Informatics - Tokyo, JP) [dblp]
  • Nikolaj Tatti (University of Helsinki, FI) [dblp]
  • Eli Upfal (Brown University - Providence, US) [dblp]
  • Fabio Vandin (University of Padova, IT) [dblp]
  • Sergei Vassilvitskii (Google - New York, US) [dblp]
  • Yllka Velaj (Universität Wien, AT) [dblp]
  • Geoffrey Webb (Monash University - Clayton, AU) [dblp]
  • Anthony Wirth (The University of Sydney, AU) [dblp]

Classification
  • Data Structures and Algorithms
  • Databases
  • Social and Information Networks

Keywords
  • Data Mining
  • Knowledge Discovery from Data
  • Probabilistic Analysis
  • Randomized Algorithms
  • Statistical Hypothesis Testing