Dagstuhl Seminar 14171
Evaluating Software Verification Systems: Benchmarks and Competitions
( Apr 21 – Apr 25, 2014 )
Permalink
Organizers
- Dirk Beyer (Universität Passau, DE)
- Marieke Huisman (University of Twente, NL)
- Vladimir Klebanov (KIT - Karlsruher Institut für Technologie, DE)
- Rosemary Monahan (NUI Maynooth, IE)
Contact
- Annette Beyer (for administrative matters)
In the last decade, technologies for the formal verification of software - such as abstract interpretation, deductive verification, and model checking - have been rapidly maturing and can be expected to complement and partly replace traditional software engineering methods. It has also become accepted wisdom that regular comparative evaluation of systems helps focus research, identify relevant problems, bolster development, and advance the field in general. Benchmark libraries and competitions are two popular evaluation approaches.
The desire to evaluate software verification systems has raised many still unanswered questions. What are the proper empirical approaches and criteria for effective comparative evaluation? What are the appropriate hardware and software environments? How to assess usability of verification systems? How to design, acquire, structure, publish, and use benchmarks and problem collections? For verification systems that require user interaction, there is an additional dimension of questions involving the human factor.
The seminar aims to advance comparative empirical evaluation of software verification systems by bringing together current and future competition organizers and participants, benchmark maintainers, as well as practitioners and researchers interested in the topic.
- To advance the technical state of comparative empirical evaluation of software verification tools.
- To achieve cross-fertilization between verification communities on common/related issues such as selection of relevant problems, notions of correctness, questions of programming language semantics, etc.
- To explore common evaluation of different kinds of verification tools, its appropriate scope and techniques.
- To raise mutual awareness between verification communities concerning terminology, techniques and trends.
- To promote comparative empirical evaluation in the larger formal methods community.
- Several half-day hands-on tool evaluation sessions in the style of competitions. We intend to experiment with different formats and setups.
- Discussions on open issues outlined above, further fueled by the results of evaluation sessions.
- Presentations on latest developments in the field.
- Several longer, tutorial-style presentations to give participants a better understanding of a particular tool and its capabilities resp. provide an overview of a particular class of tools.
The seminar aimed to advance comparative empirical evaluation of software verification systems by bringing together current and future competition organizers and participants, benchmark maintainers, as well as practitioners and researchers interested in the topic.
The objectives of the seminar were to (1) advance the technical state of comparative empirical evaluation of verification tools, (2) achieve cross-fertilization between verification communities on common/related issues such as selection of relevant problems, notions of correctness, questions of programming language semantics, etc., (3) explore common evaluation of different kinds of verification tools, its appropriate scope and techniques, (4) raise mutual awareness between verification communities concerning terminology, techniques and trends, and (5) promote comparative empirical evaluation in the larger formal methods community.
Altogether, 43 researchers and practitioners have attended the seminar. A vast majority of the attendees (almost 90%) have participated either in the SV-COMP or in the VerifyThis (and related, e.,g., VSComp/VSTTE) verification competitions. For lack of better terms, we tend to refer to these communities as the "automatic verification" and "deductive verification" community respectively, though, as Section 1 discusses in more detail, these labels are based on pragmatics and history rather than on technical aspects.
The presentations, hands-on sessions, and discussions provided valuable feedback that will help competition organizers improve future installments. To continue the effort, a task force will compile a map of the--in the meantime very diverse--competition landscape to identify and promote useful evaluation techniques. It was agreed that evaluation involving both automatic and deductive verification tools would be beneficial to both communities as it would demonstrate the strengths and weaknesses of each approach. Both SV-COMP and VerifyThis will be associated with the ETAPS conference in 2015.
A call to the public: It has been reported that competition-verification challenges have been used as homework for students. The seminar organisers would appreciate feedback and experience reports from such exercises.
Seminar Structure
The seminar was structured as a highly interactive event, rather than a sequence of talks, compared to workshops or conferences. The seminar opened with a session of lightning talks that gave every participant two minutes to introduce themselves and mark their activities and interests in verification and in comparative evaluation in particular.
In order to give participants insight into different verification techniques and practical capabilities of some existing verification tools, the seminar featured short overviews of the state of the art in deductive verification resp. automatic verification, as well as several tutorials, hands-on sessions, and accompanying discussions. These included a longer tutorial on deductive verification with the Dafny system together with a hands-on session, as well as short mini-tutorials on the automatic verifiers CPAchecker and CBMC, and deductive verifiers KIV and VeriFast. Another hands-on session concluded the seminar.
Discussions on evaluation techniques and setups were initiated with presentations by competition organizers and benchmark collectors. The presented evaluation vehicles included: VerifyThis competition (deductive verification), SV-COMP (automatic verification), VSTTE competition (deductive verification), SMT-COMP (Satisfiability Modulo Theories), Run-time Verification competition, RERS challenge (Rigorous Examination of Reactive Systems), INTS benchmarks (Integer Numerical Transition Systems).
Since evaluation must be grounded with the requirements of current and prospective users of verification technology, the seminar incorporated contributions from industrial participants. Among them were a talk on the use of the SPARK deductive verification tool-set as a central tool for the development of high-integrity systems at Altran UK, a talk on the use of automatic verification tools in the Linux Driver Verification project, an accompanying discussion, as well as statements by other industry representatives (from GrammaTech, Galois, LLNL, and Microsoft Research).
Verification Communities: Remarks on Commonalities, Differences, and Terminology
Important goals of the seminar were to raise mutual awareness and to foster cross-fertilization between verification communities. We may habitually refer to communities as "automatic verification" or "deductive verification", but as time passes, these labels become less adequate.
A trend is apparent that different types of tools are slowly converging, both technically and pragmatically. Instances of both automatic and deductive verifiers may use symbolic execution or SMT solvers. Automatic verifiers can synthesize (potentially quantified) invariants, verify infinite-state systems, or systems that are heap-centric.
The pace of development is high and the surveys are costly (the last comprehensive survey on automatic verification appeared in 2008). As a consequence, community outsiders typically have an outdated--sometimes by decades--view on verification technology that does not reflect the state of the art. We expect publications from competitions to fill the void between more comprehensive surveys.
One of the terminological pitfalls concerns the use of the attribute "automatic". For instance, model checking and data-flow analysis are typically advertised as "automatic". This is indeed true in the sense that the model-checking user does not have to supply proof hints such as loop invariants to the tool. On the other hand, using a model checker in practical situations may as well require user interaction for the purpose of creating test drivers, choosing parameters, tuning abstractions, or interpreting error paths (which can be quite complex). These aspects are typically abstracted away during evaluation of automatic verifiers, which allows better standardization but does not capture all aspects that are relevant in practice.
The situation is further confused by the fact that some deductive verifiers are also advertised as "automatic", even though all established deductive verification systems require user interaction and the amount of interaction that is needed with different tools is not radically different. The main meaningful differences are rather
- whether user interaction happens only at the beginning of a single proof attempt or whether the user can/has to intervene during proof construction, and
- whether user interaction happens in a purely textual manner or whether non-textual interaction is possible/required.
The seminar has confirmed the need for improved terminology, as well as made an attempt to eliminate misconceptions and communication pitfalls. Unfortunately, there is still no widely-accepted and usable terminology to communicate these distinctions.
- Aws Albarghouthi (University of Toronto, CA) [dblp]
- Ezio Bartocci (TU Wien, AT) [dblp]
- Bernhard Beckert (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Dirk Beyer (Universität Passau, DE) [dblp]
- Marc Brockschmidt (Microsoft Research UK - Cambridge, GB) [dblp]
- David Cok (GrammaTech Inc. - Ithaca, US) [dblp]
- Gidon Ernst (Universität Augsburg, DE) [dblp]
- Marie Farrell (NUI Maynooth, IE)
- Jean-Christophe Filliâtre (University Paris-Sud, FR) [dblp]
- Bernd Fischer (University of Stellenbosch, ZA) [dblp]
- Alberto Griggio (Bruno Kessler Foundation - Trento, IT) [dblp]
- Radu Grigore (University of Oxford, GB) [dblp]
- Arie Gurfinkel (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Matthias Heizmann (Universität Freiburg, DE) [dblp]
- Marieke Huisman (University of Twente, NL) [dblp]
- Bart Jacobs (KU Leuven, BE) [dblp]
- Alexey Khoroshilov (Russian Academy of Sciences - Moscow, RU) [dblp]
- Joseph Roland Kiniry (Galois - Portland, US) [dblp]
- Vladimir Klebanov (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- K. Rustan M. Leino (Microsoft Corporation - Redmond, US) [dblp]
- Stefan Löwe (Universität Passau, DE) [dblp]
- Antoine Miné (ENS - Paris, FR) [dblp]
- Rosemary Monahan (NUI Maynooth, IE) [dblp]
- Wojciech Mostowski (University of Twente, NL) [dblp]
- Peter Müller (ETH Zürich, CH) [dblp]
- Petr Müller (Brno University of Technology, CZ) [dblp]
- Vadim Mutilin (Russian Academy of Sciences - Moscow, RU) [dblp]
- Andrei Paskevich (University Paris-Sud, FR) [dblp]
- Nadia Polikarpova (ETH Zürich, CH) [dblp]
- Arend Rensink (University of Twente, NL) [dblp]
- Philipp Rümmer (Uppsala University, SE) [dblp]
- Andrey Rybalchenko (Microsoft Research UK - Cambridge, GB) [dblp]
- Gerhard Schellhorn (Universität Augsburg, DE) [dblp]
- Markus Schordan (LLNL - Livermore, US) [dblp]
- Carsten Sinz (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Bernhard Steffen (TU Dortmund, DE) [dblp]
- Jan Strejcek (Masaryk University - Brno, CZ) [dblp]
- Michael Tautschnig (Queen Mary University of London, GB) [dblp]
- Mattias Ulbrich (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Jaco van de Pol (University of Twente, NL) [dblp]
- Angela Wallenburg (Altran UK - Bath, GB) [dblp]
- Philipp Wendler (Universität Passau, DE) [dblp]
- Thomas Wies (New York University, US) [dblp]
Classification
- semantics / formal methods
- verification / logic
Keywords
- Formal Verification
- Model Checking
- Deductive Verification
- Competitions and Benchmarks