Dagstuhl-Seminar 24391: Statistical and Probabilistic Methods in Algorithmic Data Analysis

Dagstuhl-Seminar 24391

Statistical and Probabilistic Methods in Algorithmic Data Analysis

( 22. Sep – 27. Sep, 2024 )

(zum Vergrößern in der Bildmitte klicken)

Permalink

Bitte benutzen Sie folgende Kurz-Url zum Verlinken dieser Seite: https://www.dagstuhl.de/24391

Organisatoren

Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE)
Matteo Riondato (Amherst College, US)
Eli Upfal (Brown University - Providence, US)

Kontakt

Andreas Dolzmann (für wissenschaftliche Fragen)
Jutka Gasiorowski (für administrative Fragen)

Gemeinsame Dokumente

Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)

Programm

Programm

Motivation

Show Motivation

Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve desirable scalability and accuracy guarantees. At the same time, modern data-analysis tasks require the use of advanced statistics to handle challenges such as testing for multiple hypotheses or identifying dependencies among data points, such as in time series or graphs. Probabilistic methods are also at the core theoretical computer-science areas, such as sublinear algorithms and average-case analysis. To obtain efficient data-analysis algorithms, probabilistic methods require careful balancing of theoretical and practical aspects. This Dagstuhl Seminar brings together researchers interested in statistical and probabilistic methods to design and analyze scalable algorithms for discovering knowledge in large and rich datasets. We plan to cover the following topics, among others.

Approximation Algorithms for Data Analysis: Exploring data quickly is important to guide further analysis. Very large datasets will not fit in the main memory or even the disk of a single machine, making exploratory analysis excessively slow and cumbersome. Approximation algorithms trade off the accuracy of the answer for the speed in which the approximate answer is obtained. Such solutions are acceptable when they offer stringent theoretical guarantees on their quality.

Data Stream and Sketches: In many applications new data arrive at all times: e.g., in online social networks new edges, representing interactions between users, arrive continuously. Data deletions can also be considered. Such dynamic settings are modeled nicely by data streams. The computational challenge is to use limited memory to maintain different statistics (e.g., subgraph counts, patterns, etc.) always up to date as the stream evolves with insertions and deletions. The best solutions for these problems use sketches, such as samples and/or coresets, i.e., a succinct representations of the data seen so far (or of the relevant data, if the computation is to be performed over a sliding window), to be used for computing a good approximation of the quantities of interest. Their development usually requires the analysis of statistical properties by studying the probability distribution of the content of the sketch.

Mining Temporal Data and Time Series: There is a growing interest in mining temporal data, i.e., data that is generated by a process that varies in time. This type of data is not limited to “traditional” time series, but also includes more complex types of data, such as temporal networks. In contrast to the extensive work on model-based time-series analysis, in data mining the focus is on non-parametric, model-agnostic, analysis. For instance, to estimate the current demand for products based on their past demand, it is often unrealistic to assume that the underlying process is time invariant. Instead, we search for patterns and trends in the data with minimal prior assumptions on the stochastic data-generation process, and taking into account the possibility of unknown distribution drift. The challenge, in this case, would be to bound the error in the computation of patterns or trends in terms of minimal conditions on the drift. In temporal graphs, the focus is on computing graph characteristics and/or vertex or edge scores (such as centralities) defined over multiple snapshots of a graph.

Statistically-Sound Knowledge Discovery from Data: In statistically-sound Knowledge Discovery from Data (KDD), results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the random, partially unknown process that generated the data. The challenges in developing methods for Statistically-sound KDD include (1) understanding how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the dataset size to the number and complexity of hypotheses.

Creative Commons BY 4.0

Aristides Gionis, Matteo Riondato, and Eli Upfal

Teilnehmer

Zeige Teilnehmer

Please log in to DOOR to see more details.

Florian Adriaens (University of Helsinki, FI) [dblp]
Aris Anagnostopoulos (Sapienza University of Rome, IT) [dblp]
Luca Becchetti (Sapienza University of Rome, IT) [dblp]
Fabian Brandt-Tumescheit (HU Berlin, DE) [dblp]
Gianmarco De Francisci Morales (CENTAI - Torino, IT) [dblp]
Alessandro Epasto (Google - New York, US) [dblp]
Esther Galbrun (University of Kuopio, FI) [dblp]
Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Shahrzad Haddadan (Rutgers Business School - Piscataway, US) [dblp]
Ravi Kumar (Google - Mountain View, US) [dblp]
Silvio Lattanzi (Google - Barcelona, ES) [dblp]
Stefano Leonardi (Sapienza University of Rome, IT) [dblp]
Sebastian Lüderssen (TU Wien, AT)
Alessio Mazzetto (Brown University - Providence, US) [dblp]
Pauli Miettinen (University of Kuopio, FI) [dblp]
Stefan Neumann (Technische Universität Wien, AT) [dblp]
Lutz Oettershagen (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Rasmus Pagh (University of Copenhagen, DK) [dblp]
Andrea Pietracaprina (University of Padova, IT) [dblp]
Giulia Preti (CENTAI - Torino, IT) [dblp]
Geppino Pucci (University of Padova, IT) [dblp]
Matteo Riondato (Amherst College, US) [dblp]
Will Rosenbaum (University of Liverpool, GB) [dblp]
Larry Rudolph (Two Sigma Investments LP - New York, US) [dblp]
Ilie Sarpe (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Mauro Sozio (Telecom Paris, FR) [dblp]
Mahito Sugiyama (National Institute of Informatics - Tokyo, JP) [dblp]
Nikolaj Tatti (University of Helsinki, FI) [dblp]
Eli Upfal (Brown University - Providence, US) [dblp]
Fabio Vandin (University of Padova, IT) [dblp]
Sergei Vassilvitskii (Google - New York, US) [dblp]
Yllka Velaj (Universität Wien, AT) [dblp]
Geoffrey Webb (Monash University - Clayton, AU) [dblp]
Anthony Wirth (The University of Sydney, AU) [dblp]

Klassifikation

Data Structures and Algorithms
Databases
Social and Information Networks

Schlagworte

Data Mining
Knowledge Discovery from Data
Probabilistic Analysis
Randomized Algorithms
Statistical Hypothesis Testing

Seminar 24391

Suche auf der Schloss Dagstuhl Webseite

Schloss Dagstuhl Services

Seminare

Innerhalb dieser Seite:

Externe Seiten:

Publishing

Innerhalb dieser Seite:

Externe Seiten:

dblp

Innerhalb dieser Seite:

Externe Seiten:

Dagstuhl-Seminar 24391

Statistical and Probabilistic Methods in Algorithmic Data Analysis

( 22. Sep – 27. Sep, 2024 )

Permalink

Organisatoren

Kontakt

Gemeinsame Dokumente

Programm

Motivation

Teilnehmer

Klassifikation

Schlagworte