Dagstuhl-Seminar 14261
Software Development Analytics
( 22. Jun – 27. Jun, 2014 )
Permalink
Organisatoren
- Harald Gall (Universität Zürich, CH)
- Tim Menzies (West Virginia University - Morgantown, US)
- Laurie Williams (North Carolina State University - Raleigh, US)
- Thomas Zimmermann (Microsoft Corporation - Redmond, US)
Kontakt
Software and its development generate an inordinate amount of data. For example, check-ins, work items, bug reports and test executions are recorded in software repositories such as CVS, Subversion, GIT, and Bugzilla. Telemetry data, run-time traces, and log files reflect how customers experience software, which includes application and feature usage and exposes performance and reliability.
The sheer amount is impressive: As of July 2013, Mozilla Firefox had 900,000 bug reports, and platforms such as Sourceforge.net and GitHub hosted millions of projects with millions of users. Industrial projects have many sources of data at similar scale. But how can this data be used to improve software? Software analytics takes this data and turns it into actionable insight to inform better decisions related to software. Analytics is commonly used in many businesses—notably in marketing, to better reach and understand customers. The application of analytics to software data is becoming more popular.
To a large extent, software analytics is about what we can learn and share about software. The data include our own projects but also the software projects by others. Looking back at decades of research in empirical software engineering and mining software repositories, software analytics lets us share all of the following:
- Sharing insights. Specific lessons learned or empirical findings. An example is that in Windows Vista it was possible to build high-quality software using distributed teams if the management is structured around code functionality (Christian Bird and his colleagues).
- Sharing models. One of the early models was proposed by Fumio Akiyama and says that we should expect over a dozen bugs per 1,000 lines of code. In addition to defect models, plenty of other models (for example effort estimation, retention and engagement) can be built for software.
- Sharing methods. Empirical findings such as insights and models are often context-specific, e.g., depend on the project that was studied. However, the method ("recipe") to create findings can often be applied across projects. We refer to methods as the techniques by which we can transform data into insight and models.
- Sharing data. By sharing data, we can use and evolve methods to create better insight and models.
Despite many achievements, there are several challenges ahead for software analytics: How can we make data useful to a wide audience, not just to developers but to anyone involved in software? What can we learn from the vast amount of unexplored data? How can we learn from incomplete or biased data? How can we better tie usage analytics to development analytics? When and what lessons can we take from one project and apply to another? How can we establish smart data science as a discipline in software engineering practice and research as well as education?
In this seminar, we bring together researchers and practitioners from academia and industry who are interested in empirical software engineering and mining software repositories to share their insights, models, methods, and/or data. More specifically, we invite you to (1) discuss the next generation of software analytics; and to (2) contribute to a Software Analytics Manifesto that describes the extent to which software data can be exploited to support decisions related to development and usage of software.
We expect the seminar to outline a set of challenges for analytics on software data, which will help to focus the research effort in this field. The seminar will provide ample opportunities for discussion between attendees and also provide a platform for collaboration between attendees. We expect the seminar to set exciting directions for understanding and acting on data. Please join us for the future of software analytics.
Software and its development generate an inordinate amount of data. For example, check-ins, work items, bug reports and test executions are recorded in software repositories such as CVS, Subversion, GIT, and Bugzilla. Telemetry data, run-time traces, and log files reflect how customers experience software, which includes application and feature usage and exposes performance and reliability. The sheer amount is truly impressive:
- As of July 2013, Mozilla Firefox had 900,000 bug reports, and platforms such as Sourceforge.net and GitHub hosted millions of projects with millions of users.
- Industrial projects have many sources of data at similar scale.
But how can this data be used to improve software? Software analytics takes this data and turns it into actionable insight to inform better decisions related to software. Analytics is commonly used in many businesses - notably in marketing, to better reach and understand customers. The application of analytics to software data is becoming more popular.
To a large extent, software analytics is about what we can learn and share about software. The data include our own projects but also the software projects by others. Looking back at decades of research in empirical software engineering and mining software repositories, software analytics lets us share all of the following:
- Sharing insights. Specific lessons learned or empirical findings. An example is that in Windows Vista it was possible to build high-quality software using distributed teams if the management is structured around code functionality (Christian Bird and his colleagues).
- Sharing models. One of the early models was proposed by Fumio Akiyama and says that we should expect over a dozen bugs per 1,000 lines of code. In addition to defect models, plenty of other models (for example effort estimation, retention and engagement) can be built for software.
- Sharing methods. Empirical findings such as insights and models are often context-specific, e.g., depend on the project that was studied. However, the method ("recipe") to create findings can often be applied across projects. We refer to "methods" as the techniques by which we can transform data into insight and models.
- Sharing data. By sharing data, we can use and evolve methods to create better insight and models.
The goal of this seminar was to build a roadmap for future work in this area. Despite many achievements, there are several challenges ahead for software analytics:
- How can we make data useful to a wide audience, not just to developers but to anyone involved in software?
- What can we learn from the vast amount of unexplored data?
- How can we learn from incomplete or biased data?
- How can we better tie usage analytics to development analytics?
- When and what lessons can we take from one project and apply to another?
- How can we establish smart data science as a discipline in software engineering practice and research as well as education?
Seminar Format
In this seminar, we brought together researchers and practitioners from academia and industry who are interested in empirical software engineering and mining software repositories to share their insights, models, methods, and/or data. Before the seminar, we collected input from the participants through an online survey to collect relevant themes and papers for the seminar. Most themes from the survey fell into the categories of method (e.g., measurement, visualization, combination of qualitative with quantitative methods), data (e.g. usage/telemetry, security, code, people, etc.), and best practices and fallacies (e.g. how to choose techniques, how to deal with noise and missing data, correlation vs. causation). A theme that also emerged in the pre-Dagstuhl survey was analytics for the purpose of theory format, i.e. "data analysis to support software engineering theory formation (or, data analytics in support of software science, as opposed to software engineering)".
At the seminar, we required that attendees
- discuss the next generation of software analytics;
- contribute to a Software Analytics Manifesto that describes the extent to which software data can be exploited to support decisions related to development and usage of software.
Attendees were required to outline a set of challenges for analytics on software data, which will help to focus the research effort in this field. The seminar provided ample opportunities for discussion between attendees and also provide a platform for collaboration between attendees since our time was divided equally between:
- Plenary sessions where everyone gave short (10 minute) presentations on their work.
- Breakout sessions where focus groups worked on shared tasks.
Our schedule was very dynamic. Each day ended with a "think-pair-share" session where some focus for the next day was debated first in pairs, then shared with the whole group. Each night, the seminar organizers would take away the cards generated in the "think-pair-share" sessions and use that feedback to reflect on how to adjust the next day's effort.
- Bram Adams (Polytechnique Montreal, CA) [dblp]
- Alberto Bacchelli (TU Delft, NL) [dblp]
- Ayse Basar Bener (Ryerson University - Toronto, CA) [dblp]
- Trevor Carnahan (Microsoft Corporation - Redmond, US) [dblp]
- Serge Demeyer (University of Antwerp, BE) [dblp]
- Premkumar T. Devanbu (University of California - Davis, US) [dblp]
- Stephan Diehl (Universität Trier, DE) [dblp]
- Michael W. Godfrey (University of Waterloo, CA) [dblp]
- Alessandra Gorla (Universität des Saarlandes, DE) [dblp]
- Georgios Gousios (TU Delft, NL) [dblp]
- Mark Grechanik (University of Illinois - Chicago, US) [dblp]
- Michaela Greiler (Microsoft Corporation - Redmond, US) [dblp]
- Abram Hindle (University of Alberta - Edmonton, CA) [dblp]
- Reid Holmes (University of Waterloo, CA) [dblp]
- Miryung Kim (University of Texas - Austin, US) [dblp]
- A. J. Ko (University of Washington - Seattle, US) [dblp]
- Lucas M. Layman (Fraunhofer USA - College Park, US) [dblp]
- Andrian Marcus (Wayne State University, US) [dblp]
- Nenad Medvidovic (University of Southern California - Los Angeles, US) [dblp]
- Tim Menzies (West Virginia University - Morgantown, US) [dblp]
- Leandro L. Minku (University of Birmingham, GB) [dblp]
- Audris Mockus (Avaya - Basking Ridge, US) [dblp]
- Brendan Murphy (Microsoft Research UK - Cambridge, GB) [dblp]
- Meiyappan Nagappan (Rochester Institute of Technology, US) [dblp]
- Alessandro Orso (Georgia Institute of Technology - Atlanta, US) [dblp]
- Martin Pinzger (Alpen-Adria-Universität Klagenfurt, AT) [dblp]
- Denys Poshyvanyk (College of William and Mary - Williamsburg, US) [dblp]
- Venkatesh-Prasad Ranganath (Kansas State University, US) [dblp]
- Romain Robbes (University of Chile - Santiago de Chile, CL) [dblp]
- Martin Robillard (McGill University - Montreal, CA) [dblp]
- Guenther Ruhe (University of Calgary, CA) [dblp]
- Per Runeson (Lund University, SE) [dblp]
- Anita Sarma (University of Nebraska - Lincoln, US) [dblp]
- Emad Shihab (Concordia University - Montreal, CA) [dblp]
- Diomidis Spinellis (Athens University of Economics and Business, GR) [dblp]
- Margaret-Anne Storey (University of Victoria, CA) [dblp]
- Burak Turhan (University of Oulu, FI) [dblp]
- Stefan Wagner (Universität Stuttgart, DE) [dblp]
- Patrick Wagstrom (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Jim Whitehead (University of California - Santa Cruz, US) [dblp]
- Laurie Williams (North Carolina State University - Raleigh, US) [dblp]
- Dongmei Zhang (Microsoft Research - Beijing, CN) [dblp]
- Thomas Zimmermann (Microsoft Corporation - Redmond, US) [dblp]
Klassifikation
- software engineering
Schlagworte
- Software development
- Data-driven decision making
- Analytics
- Empirical software engineering
- Mining software repositories
- Business intelligence
- Predictive analytics