Dagstuhl Seminar 13251: Parallel Data Analysis

Dagstuhl Seminar 13251

Parallel Data Analysis

( Jun 16 – Jun 21, 2013 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/13251

Organizers

Artur Andrzejak (Universität Heidelberg, DE)
Joachim Giesen (Universität Jena, DE)
Raghu Ramakrishnan (Microsoft Corporation - Redmond, US)
Ion Stoica (University of California - Berkeley, US)

Contact

Susanne Bach-Bernhard (for administrative matters)

Publications

Parallel Data Analysis (Dagstuhl Seminar 13251). Artur Andrzejak, Joachim Giesen, Raghu Ramakrishnan, and Ion Stoica. In Dagstuhl Reports, Volume 3, Issue 6, pp. 67-82, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2013)

Motivation

Show Motivation

Parallel data analysis accelerates the investigation of data sets of all sizes, and is indispensable when processing huge volumes of data. The current ubiquity of parallel hardware such as multi-core processors, modern GPUs, and computing clusters has created an excellent environment for this approach. However, exploiting these computing resources effectively requires significant efforts due to the lack of mature frameworks, software, and even algorithms designed for data analysis in such computing environments.

As a result, parallel data analysis is often being used only as the last resort, i.e., when the data size becomes too big for sequential data analysis, and it is hardly ever used for analyzing small and medium-sized data sets. The barrier of adoption is even higher for specialists from other areas such as sciences, business, and commerce. These users often have to make do with slower, yet much easier to use sequential programming environments and tools, regardless of the data size.

The seminar will try to address these challenges by focusing on three major goals:

Designing efficient and scalable parallel algorithms for machine learning and statistical analysis.
Providing user-friendly parallel programming paradigms and cross-platform frameworks or libraries for easy implementation and experimentation.
Developing benchmarks, standardized data sets, and public platforms for evaluating (parallel) data analysis algorithms and environments.

To achieve this, the seminar will bring together academic researchers and industry practitioners to foster cross-disciplinary interactions on parallel analysis of scientific and business data. In particular, it will target the communities in the areas of machine learning and data mining, parallel and distributed systems, database systems, and languages and tools for data analysis.

The seminar program will include individual presentations on new research results, tools and usage scenarios, plenary sessions, as well as work in focus groups. The primary role of the focus groups will be to foster the collaboration of the participants on new project proposals, research papers, and the creation of benchmarks for parallel data analysis algorithms and tools.

Summary

Show Summary

Motivation and goals

Parallel data analysis accelerates the investigation of data sets of all sizes, and is indispensable when processing huge volumes of data. The current ubiquity of parallel hardware such as multi-core processors, modern GPUs, and computing clusters has created an excellent environment for this approach. However, exploiting these computing resources effectively requires significant efforts due to the lack of mature frameworks, software, and even algorithms designed for data analysis in such computing environments.

As a result, parallel data analysis is often being used only as the last resort, i.e., when the data size becomes too big for sequential data analysis, and it is hardly ever used for analyzing small and medium-sized data sets though it could be also beneficial for there, i.e., by cutting compute time down from hours to minutes or even making the data analysis process interactive. The barrier of adoption is even higher for specialists from other areas such as sciences, business, and commerce. These users often have to make do with slower, yet much easier to use sequential programming environments and tools, regardless of the data size.

The seminar participants have tried to address these challenges by focusing on the following goals:

Providing user-friendly parallel programming paradigms and cross-platform frameworks or libraries for easy implementation and experimentation.
Designing efficient and scalable parallel algorithms for machine learning and statistical analysis in connection with an analysis of use cases.

The program

The seminar program consisted of individual presentations on new results and ongoing work, a plenary session, as well as work in two working groups. The primary role of the focus groups was to foster the collaboration of the participants, allowing cross-disciplinary knowledge sharing and insights. Work in one group is still ongoing and targets as a result a publication in a magazine.

The topics of the plenary session and the working groups were the following ones:

Panel ``From Big Data to Big Money'
Working group ``A'': Algorithms and applications
Working group ``P'': Programming paradigms, frameworks and software.

Creative Commons BY 3.0 Unported license

Artur Andrzejak, Joachim Giesen, Raghu Ramakrishnan, and Ion Stoica

Participants

Show Participants

Artur Andrzejak (Universität Heidelberg, DE) [dblp]
Ron Bekkerman (Carmel Ventures - Herzeliya, IL) [dblp]
Joos-Hendrik Böse (SAP SE - Berlin, DE) [dblp]
Sebastian Breß (Universität Magdeburg, DE) [dblp]
Patrick Briest (McKinsey&Company - Düsseldorf, DE) [dblp]
Jürgen Broß (FU Berlin, DE) [dblp]
Lutz Büch (Universität Heidelberg, DE) [dblp]
Michael J. Cafarella (University of Michigan - Ann Arbor, US) [dblp]
Surajit Chaudhuri (Microsoft Corporation - Redmond, US) [dblp]
Tyson Condie (Yahoo! Inc. - Burbank, US) [dblp]
Giuseppe Di Fatta (University of Reading, GB) [dblp]
Rodrigo Fonseca (Brown University - Providence, US) [dblp]
Johannes Fürnkranz (TU Darmstadt, DE) [dblp]
Joao Gama (University of Porto, PT) [dblp]
Joachim Giesen (Universität Jena, DE) [dblp]
Philipp Große (SAP SE - Walldorf, DE) [dblp]
Max Heimel (TU Berlin, DE) [dblp]
Yves J. Hilpisch (Visixion GmbH, DE)
Anthony D. Joseph (University of California - Berkeley, US) [dblp]
George Karypis (University of Minnesota - Minneapolis, US) [dblp]
Shonali Krishnaswamy (Infocomm Research - Singapore, SG) [dblp]
Soeren Laue (Universität Jena, DE) [dblp]
Frank McSherry (Microsoft Corp. - Mountain View, US) [dblp]
Klaus Mueller (Stony Brook University, US) [dblp]
Jens K. Müller (Universität Jena, DE) [dblp]
Srinivasan Parthasarathy (Ohio State University - Columbus, US) [dblp]
Tom Peterka (Argonne National Laboratory, US) [dblp]
Raghu Ramakrishnan (Microsoft Corporation - Redmond, US) [dblp]
Ion Stoica (University of California - Berkeley, US) [dblp]
Domenico Talia (University of Calabria, IT) [dblp]
Alexandre Termier (University of Grenoble, FR) [dblp]
Markus Weimer (Microsoft Corporation - Redmond, US) [dblp]
Hans-Martin Will (SpaceCurve - Seattle, US) [dblp]
Matei Zaharia (University of California - Berkeley, US) [dblp]
Osmar Zaiane (University of Alberta - Edmonton, CA) [dblp]

Classification

artificial intelligence / robotics
data bases / information retrieval
data structures / algorithms / complexity

Keywords

Parallel machine learning
parallel data processing
data mining
software frameworks
storage and database systems

Seminar 13251

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 13251

Parallel Data Analysis

( Jun 16 – Jun 21, 2013 )

Permalink

Organizers

Contact

Publications

Motivation

Summary

Motivation and goals

The program

Participants

Classification

Keywords