Dagstuhl Seminar 15222
Human-Centric Development of Software Tools
( May 25 – May 28, 2015 )
Permalink
Organizers
- A. J. Ko (University of Washington - Seattle, US)
- Shriram Krishnamurthi (Brown University - Providence, US)
- Gail C. Murphy (University of British Columbia - Vancouver, CA)
- Janet Siegmund (Universität Passau, DE)
Contact
- Annette Beyer (for administrative matters)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
Software development has always been a human activity. From the 1940's, where a select few programmed a small number of mainframes, to today, where tens of millions of people worldwide create software for a wide range of purposes, human aspects of software development have only become more significant. Unusable software-development tools contribute cost, error, and complexity to the already expensive and failure-prone task of software development. Unusable environments for learning software development discourage badly-needed talents in the field. Progress on usefulness and usability of languages and tools for software development promises improvements in software quality, reliability, and developer productivity.
One might reasonably expect that these challenges would lead human factors in software development to be a core research topic. Unfortunately, this has not been the case. For example, we found that while 82 of 87 papers in ICSE 2012 (the major software-engineering conference) did some sort of empirical evaluation, at most 29% of those included human participants. Moreover, most evaluations only ever assess the feasibility of new tools, rarely considering their utility in practice.
In our experience, this is because of two critical obstacles: First, few computer scientists have training or experience conducting empirical studies with human participants. Second, research conferences on software tools and engineering lack reviewers with sufficient human-factors expertise. Reviewers often criticize human-factors results based on „common sense“, which can undersell the value of seemingly obvious results. Worse, reviewers often underestimate the soundness of an empirical design and reject methodologically sound studies with comments like „the results are too obvious“ or „the results have no practical relevance, because students were used as participants“. Preparing more computer-science researchers who develop software or design programming languages to conduct human-factors studies is a viable way to address both problems.
The overarching goal of this seminar is to raise the level of engagement and discourse about human factors in software engineering and programming languages. We will identify significant challenges and best practices in both the human-centered design and evaluation of programming languages and tools. We will identify opportunities to improve the quality of scientific discourse and progress on human aspects of software development. We will also identify opportunities to improve how we educate researchers about how to conduct sound human-centered evaluations in the context of software engineering.
Across our many sessions, we discussed many central issues related to research on the design of human-centric developer tools. In this summary, we discuss the key insights from each of these areas, and actionable next steps for maturing the field of human-centered developer tools.
Key Insights
Theories
Theories are a hugely important but underused aspect of our research. They help us start with an explanation, they help us explain and interpret the data we get, they help us relate our findings to others findings, and they give us vocabulary and concepts to help us organize our thinking about a phenomenon.
There are many relevant theories that we should be using:
- Attention investment is helpful in explaining why people choose to engage in programming.
- Information foraging theory helps explain where people choose to look for relevant information in code.
- Community of practice theory helps us explain how people choose to develop skills over time.
There are useful methods for generating theories, including grounded theory and participatory design. Both can result in explanations of phenomena. That said, there are often already theories about things and we don't need to engage in creating our own.
While theories are the pinnacle of knowledge, there's plenty of room for "useful knowledge" that helps us ultimately create and refine better theories. Much of the research we do now generates this useful knowledge and will eventually lead to more useful theories.
Study Recruitment
Whether developers agree to participate in a study depends on several factors:
- One factor is how much value developers perceive in participating. Value might be tangible (a gift card, a bottle of champagne), or personal (learning something from participation, or getting to share their opinion about something they are passionate about).
- Another factor in recruitment is whether the requestor is part of the developer in-group (e.g, being part of their organization, having a representative from their community conduct the research or recruit on your behalf, become part of their community before asking for their efforts)
- The cost of participating obviously has to be low, or at least low enough to account for the benefit. With these factors in mind, there are a wide range of clever and effective ways to recruit participants:
Monitor for changes in bug databases and gather data at the moment the event occurs. This makes the request timely and minimizes the cost of recall.
- Find naturalistic captures of people doing software engineering work (such as tutorials, walkthroughs, and other recorded content that developers create). This costs the nothing.
- Perform self-ethnographies or diary studies. This has some validity issues, but provides a rich source of data.
- Tag your own development work through commits to gather interesting episodes.
- Find where developers are and interview them there (e.g., the Microsoft bus stop, developer conferences), and generate low-cost, high-value ways of getting their attention (and data).
Research Questions
There was much discussion of research questions at the conference and what makes a good one. There was much agreement that our questions should be more grounded in theories, so that we can better build upon each others’ work.
Many researchers also find that the human-centered empirical studies produce results that are not directly meaningful or actionable to others. There are many possible reasons for this:
- We often don't choose research questions with more than one plausible outcome.
- We often don't report our results in a way that creates conflict and suspense. We need to show readers that there are many possible outcomes.
- We often ask ``whether'' questions, rather than ``why'' or ``when'' questions about tools, leading to limited, binary results, rather than richer, more subtle contributions.
Some of our research questions have validity issues that make them problematic:
- Research questions often fail to understand the populations they are asking about.
- Research questions often get involved in designing tools for people who are already designing tools for themselves. Instead, researchers should be building tools that have never existed, not building better versions of tools that already exist.
One opportunity for collaboration with researchers who are less human-centered is to collaborate on formative research that shapes the direction of research and discover new research opportunities for the field. This may create more positive perceptions of our skills, impact, and relevance to the broader fields of PL and SE.
Human-Centeredness
Historically, HCI concerns have focused on end user experiences rather than developer experiences, but HCI researchers have increasingly focused on developers and developer tools. But HCI often doesn't consider the culture and context of software engineering, and doesn't address the longitudinal / long term factors in education and skill acquisition, and so HCI may not be a sufficient lens through which to understand software engineering.
There is also a need to address low-end developers, not just "experts". Future research topics include the understand learnability of APIs, how to understand the experiences of engineers (from a sociological perspective studies such as Bucciarelli), how to think about tools from a knowledge prerequisite perspective.
Developer Knowledge Modeling
Much of what makes a developer effective is the knowledge in their mind, but we know little about what this knowledge is, how developers acquire it, how to measure and model it, and how to use these models to improve tools or enable new categories of tools. There are many open opportunities in this space that could lead to powerful new understandings about software engineering expertise and powerful new tools to support software engineering. Much of this new work can leverage research in education and learning sciences to get measures of knowledge.
Leveraging Software Development Analytics
We identified identifying different types of data that might be collected on programming processes and products. These included editing activities, compilation attempts and errors, execution attempts and errors, and check-ins. We considered ways in which these data could be enlisted to help improve teaching and learning, as well as the software development process:
- Automated interventions to improve programming processes
- Present visually to aid in decision making
- Generate notifications that could inform learners, teachers, and software developers of key events.
- Generating social recommendations.
These opportunities raise several questions:
- How do we leverage data to intervene in educational and collaborative software development settings?
- How do we design visual analytics environment to aid in decision making?
- Should interventions be automated, semi-automated, or manual? What are the trade offs?
Error Messages
We identified 5 broad classes of errors: (1) syntactic (conformance to a grammar), (2) type, (3) run-time (safety checks in a run-time system, such as array bounds, division by zero, etc.), (4) semantic (logical errors that aren't run-time errors) (5) stylistic. We distinguished between errors and more general forms of feedback, acknowledging that both needed support; in particular, each of these could leverage some common presentation guidelines.
We discussed why research has tended to focus more on errors for beginners than feedback for developers. Issues raised included the different scales of problems to diagnose across the two cases and differences in social norms around asking for help from other people (developers might be less likely to ask other people for help in order to protect their professional reputations). We discussed whether tools should report all errors or just some of them, and whether tools should try to prioritize among errors when presenting them. These had different nuances in each of students and practicing developers. We discussed the example of the coverity tool presenting only a subset of errors, since presenting all of them might lead developers to reject the tool for finding too much fault in the their code.
We discussed and articulated several principles of presenting errors: (1) use different visual patterns to distinguish different kinds of errors; (2) don't mislead users by giving incorrect advice on how to fix an error; (3) use multi-dimensional or multi-modal techniques to revelt error details incrementally; (4) when possible, allow programs to fail gently in the face of an error (example: soft typing moved type errors into run-time errors that only tripped when a concrete input triggered the error -- this gives the programmer some control over when to engage with the error after it arises); (5) consider ways to allow the user to query the system to narrow down the cause of the error (rather than require them to debug the entire program).
There are several open research questions:
- Should error and feedback systems become interactive, asking the user questions to help diagnose a more concrete error (rather than report a more abstract one, as often happens with compiler syntax errors)?
- Can grammars be tailored to domain-specific knowledge to yield more descriptive error messages?
- Can patterns of variable names be used to enforce conventions and reduce the rates of some kinds of errors?
- At what point should error systems expect the user to consult with another human, rather than rely only on the computer.
- When is it more helpful to show all errors (assuming we can even compute that) versus a selection of errors? How much detail should be presented about an error at first? Does presenting all information discourage users from reading error messages?
Reviewing
Researchers in human aspects of software engineering feel a strong sense of hostility towards human-centered research, despite some recent successes in some software engineering venues. Reasons for this hostility include:
- Many human-centered researchers evaluate and critique tools without offering constructive directions forward. This creates a perception that human-centered researchers dislike or hate the research that others are doing.
- Many human-centered researchers are focused on producing understanding, whereas other researchers are focused on producing better tools. This goal mismatch causes reviewers to apply inappropriate criteria to the importance and value of research contributions.
- Many research communities in programming languages and software engineering still lack sufficient methodological expertise to properly evaluate human-centered empirical work.
- It's not explicit in reviews whether someone's methodological expertise is a good match for a paper. Expert it in a topic, not expert in a method. This leads to topic expertise matches without methodological expertise matches.
- Many challenges in reviewing come from the difference between judging a paper’s validity versus judging how interesting a paper is. Non-human centered researchers do not often often find our questions interesting.
We are often our own worst enemies in reviews. We often reject each other because we're too rigid about methods (e.g., rejecting papers because of missing interrater reliability). On the other hand, we have to maintain standards. There's a lot of room for creativity in establishing rigor that is satisfying to reviewers, and we should allow for these creative ways of validating and verifying our interpretations.
Methods Training
Empirical methods are not popular to learn. However, when our students and colleagues decide to learn them, there are many papers, textbooks, classes and workshops for learning some basic concepts in human-subjects software engineering research.
There are many strategies we might employ to broadly increase methodological expertise in our research communities:
- We should spend more time in workshops and conferences teaching each other how to do methods well.
- Software engineers need to learn empirical methods too, and teaching them as undergraduates will lead to increased literacy in graduate students.
- There is much we can do to consolidate and share teaching resources that would make this instruction much more efficient.
- HCI research methods are broadly applicable and there are many more places to learn them.
There aren't good methods for researching learning issues yet. Moreover, most of these methods cannot be learned quickly. We must devise ways of teaching these methods to students and researchers over long periods of time.
- Andrew Begel (Microsoft Research - Redmond, US) [dblp]
- Alan Blackwell (University of Cambridge, GB) [dblp]
- Margaret M. Burnett (Oregon State University, US) [dblp]
- Rob DeLine (Microsoft Corporation - Redmond, US) [dblp]
- Yvonne Dittrich (IT University of Copenhagen, DK) [dblp]
- Kathi Fisler (Worcester Polytechnic Institute, US) [dblp]
- Thomas Fritz (Universität Zürich, CH) [dblp]
- Mark J. Guzdial (Georgia Institute of Technology - Atlanta, US) [dblp]
- Stefan Hanenberg (Universität Duisburg-Essen, DE) [dblp]
- James D. Herbsleb (Carnegie Mellon University, US) [dblp]
- Johannes C. Hofmeister (Heidelberg, DE) [dblp]
- Reid Holmes (University of Waterloo, CA) [dblp]
- Christopher D. Hundhausen (Washington State University - Pullman, US) [dblp]
- Antti-Juhani Kaijanaho (University of Jyväskylä, FI) [dblp]
- A. J. Ko (University of Washington - Seattle, US) [dblp]
- Rainer Koschke (Universität Bremen, DE) [dblp]
- Shriram Krishnamurthi (Brown University - Providence, US) [dblp]
- Gail C. Murphy (University of British Columbia - Vancouver, CA) [dblp]
- Emerson Murphy-Hill (North Carolina State University - Raleigh, US) [dblp]
- Brad Myers (Carnegie Mellon University, US) [dblp]
- Barbara Paech (Universität Heidelberg, DE) [dblp]
- Christopher J. Parnin (North Carolina State University - Raleigh, US) [dblp]
- Lutz Prechelt (FU Berlin, DE) [dblp]
- Peter C. Rigby (Concordia University - Montreal, CA) [dblp]
- Martin Robillard (McGill University - Montreal, CA) [dblp]
- Tobias Röhm (TU München, DE) [dblp]
- Dag Sjøberg (University of Oslo, NO) [dblp]
- Andreas Stefik (Univ. of Nevada - Las Vegas, US) [dblp]
- Harald Störrle (Technical University of Denmark - Lyngby, DK) [dblp]
- Walter F. Tichy (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Claes Wohlin (Blekinge Institute of Technology - Karlskrona, SE) [dblp]
- Thomas Zimmermann (Microsoft Corporation - Redmond, US) [dblp]
Classification
- programming languages / compiler
- society / human-computer interaction
- software engineering
Keywords
- Human Factors in Software Tools
- Empirical Evaluation of Software Tools
- Programming-Language and -Tool Design