Dagstuhl Seminar 13251
Parallel Data Analysis
( Jun 16 – Jun 21, 2013 )
Permalink
Organizers
- Artur Andrzejak (Universität Heidelberg, DE)
- Joachim Giesen (Universität Jena, DE)
- Raghu Ramakrishnan (Microsoft Corporation - Redmond, US)
- Ion Stoica (University of California - Berkeley, US)
Contact
- Susanne Bach-Bernhard (for administrative matters)
Parallel data analysis accelerates the investigation of data sets of all sizes, and is indispensable when processing huge volumes of data. The current ubiquity of parallel hardware such as multi-core processors, modern GPUs, and computing clusters has created an excellent environment for this approach. However, exploiting these computing resources effectively requires significant efforts due to the lack of mature frameworks, software, and even algorithms designed for data analysis in such computing environments.
As a result, parallel data analysis is often being used only as the last resort, i.e., when the data size becomes too big for sequential data analysis, and it is hardly ever used for analyzing small and medium-sized data sets. The barrier of adoption is even higher for specialists from other areas such as sciences, business, and commerce. These users often have to make do with slower, yet much easier to use sequential programming environments and tools, regardless of the data size.
The seminar will try to address these challenges by focusing on three major goals:
- Designing efficient and scalable parallel algorithms for machine learning and statistical analysis.
- Providing user-friendly parallel programming paradigms and cross-platform frameworks or libraries for easy implementation and experimentation.
- Developing benchmarks, standardized data sets, and public platforms for evaluating (parallel) data analysis algorithms and environments.
To achieve this, the seminar will bring together academic researchers and industry practitioners to foster cross-disciplinary interactions on parallel analysis of scientific and business data. In particular, it will target the communities in the areas of machine learning and data mining, parallel and distributed systems, database systems, and languages and tools for data analysis.
The seminar program will include individual presentations on new research results, tools and usage scenarios, plenary sessions, as well as work in focus groups. The primary role of the focus groups will be to foster the collaboration of the participants on new project proposals, research papers, and the creation of benchmarks for parallel data analysis algorithms and tools.
Motivation and goals
Parallel data analysis accelerates the investigation of data sets of all sizes, and is indispensable when processing huge volumes of data. The current ubiquity of parallel hardware such as multi-core processors, modern GPUs, and computing clusters has created an excellent environment for this approach. However, exploiting these computing resources effectively requires significant efforts due to the lack of mature frameworks, software, and even algorithms designed for data analysis in such computing environments.
As a result, parallel data analysis is often being used only as the last resort, i.e., when the data size becomes too big for sequential data analysis, and it is hardly ever used for analyzing small and medium-sized data sets though it could be also beneficial for there, i.e., by cutting compute time down from hours to minutes or even making the data analysis process interactive. The barrier of adoption is even higher for specialists from other areas such as sciences, business, and commerce. These users often have to make do with slower, yet much easier to use sequential programming environments and tools, regardless of the data size.
The seminar participants have tried to address these challenges by focusing on the following goals:
- Providing user-friendly parallel programming paradigms and cross-platform frameworks or libraries for easy implementation and experimentation.
- Designing efficient and scalable parallel algorithms for machine learning and statistical analysis in connection with an analysis of use cases.
The program
The seminar program consisted of individual presentations on new results and ongoing work, a plenary session, as well as work in two working groups. The primary role of the focus groups was to foster the collaboration of the participants, allowing cross-disciplinary knowledge sharing and insights. Work in one group is still ongoing and targets as a result a publication in a magazine.
The topics of the plenary session and the working groups were the following ones:
- Panel ``From Big Data to Big Money'
- Working group ``A'': Algorithms and applications
- Working group ``P'': Programming paradigms, frameworks and software.
- Artur Andrzejak (Universität Heidelberg, DE) [dblp]
- Ron Bekkerman (Carmel Ventures - Herzeliya, IL) [dblp]
- Joos-Hendrik Böse (SAP SE - Berlin, DE) [dblp]
- Sebastian Breß (Universität Magdeburg, DE) [dblp]
- Patrick Briest (McKinsey&Company - Düsseldorf, DE) [dblp]
- Jürgen Broß (FU Berlin, DE) [dblp]
- Lutz Büch (Universität Heidelberg, DE) [dblp]
- Michael J. Cafarella (University of Michigan - Ann Arbor, US) [dblp]
- Surajit Chaudhuri (Microsoft Corporation - Redmond, US) [dblp]
- Tyson Condie (Yahoo! Inc. - Burbank, US) [dblp]
- Giuseppe Di Fatta (University of Reading, GB) [dblp]
- Rodrigo Fonseca (Brown University - Providence, US) [dblp]
- Johannes Fürnkranz (TU Darmstadt, DE) [dblp]
- Joao Gama (University of Porto, PT) [dblp]
- Joachim Giesen (Universität Jena, DE) [dblp]
- Philipp Große (SAP SE - Walldorf, DE) [dblp]
- Max Heimel (TU Berlin, DE) [dblp]
- Yves J. Hilpisch (Visixion GmbH, DE)
- Anthony D. Joseph (University of California - Berkeley, US) [dblp]
- George Karypis (University of Minnesota - Minneapolis, US) [dblp]
- Shonali Krishnaswamy (Infocomm Research - Singapore, SG) [dblp]
- Soeren Laue (Universität Jena, DE) [dblp]
- Frank McSherry (Microsoft Corp. - Mountain View, US) [dblp]
- Klaus Mueller (Stony Brook University, US) [dblp]
- Jens K. Müller (Universität Jena, DE) [dblp]
- Srinivasan Parthasarathy (Ohio State University - Columbus, US) [dblp]
- Tom Peterka (Argonne National Laboratory, US) [dblp]
- Raghu Ramakrishnan (Microsoft Corporation - Redmond, US) [dblp]
- Ion Stoica (University of California - Berkeley, US) [dblp]
- Domenico Talia (University of Calabria, IT) [dblp]
- Alexandre Termier (University of Grenoble, FR) [dblp]
- Markus Weimer (Microsoft Corporation - Redmond, US) [dblp]
- Hans-Martin Will (SpaceCurve - Seattle, US) [dblp]
- Matei Zaharia (University of California - Berkeley, US) [dblp]
- Osmar Zaiane (University of Alberta - Edmonton, CA) [dblp]
Classification
- artificial intelligence / robotics
- data bases / information retrieval
- data structures / algorithms / complexity
Keywords
- Parallel machine learning
- parallel data processing
- data mining
- software frameworks
- storage and database systems