Dagstuhl Seminar 23341
Functionally Safe Multi-Core Systems
( Aug 20 – Aug 25, 2023 )
Permalink
Organizers
- Iain Bate (University of York, GB)
- Thidapat (Tam) Chantem (Virginia Polytechnic Institute & State University - Arlington, US)
- Louise Harney (Leonardo UK Ltd - Edinburgh, GB)
- Claire Maiza (University of Grenoble, FR)
- Georg von der Brüggen (TU Dortmund, DE)
Coordinator
- Ian Gray (University of York, GB)
Contact
- Michael Gerke (for scientific matters)
- Christina Schwarz (for administrative matters)
Impacts
- Work-in-Progress: Impacts of Critical-SectionGranularity When Accessing Shared Resources : article in 2023 IEEE Real-Time Systems Symposium (RTSS) - Amert, Tanya; Nemitz, Catherine E. - Los Alamitos : IEEE, 2023. - pp. 439-442.
- Taking One for the Team : Trading Overhead and Blocking for Optimal Critical-Section Granularity with a Shared GPU : article in RTNS '24: Proceedings of the 32nd International Conference on Real-Time Networks and Systems - Amert, Tanya; Nemitz, Catherine E. - New York : ACM, 2024. - Pages 94 - 104.
Schedule
There is a significant problem on the horizon in the field of embedded and real-time systems. The traditional approach for certifying high-integrity systems is no longer possible due to the increased complexity of modern applications, and the hardware platforms on which they execute are becoming heterogeneous multi-core systems on chip.
Considering how functional safety can be provided for multi-core systems, this Dagstuhl Seminar aimed to bring together experts from both academic and industry as well as participants from the three relevant layers, namely, application, middleware, and platform. The goal was to look at these topics from the individual perspective and to inspire interesting and fruitful discussion.
Motivation
High-integrity domains such as automotive and avionics represent an important success story for academia. There is a long and storied history of knowledge transfer from academia into these domains, taking theory, approaches, and tooling and using it to build and certify complex systems in which safety is paramount.
This seminar is organised at a time when the domain is in an unprecedented moment of flux. Application complexity due to increasing consumer demand has skyrocketed over the last decade, and so to meet these demands, hardware manufacturers have created increasingly complex hardware platforms. Where previously software would be used for basic control, engine management, and rudimentary driver assist systems, increasing amounts of automation and high-fidelity entertainment are being deployed. Current estimates suggest a high-end vehicle currently runs over one hundred million lines of code, and that number will only increase. A corresponding increase in complexity has also been observed in the most traditionally conservative domains, such as avionics and space.
The traditional approach to such systems would use simple processing devices upon which small sets of tasks could be allocated. Existing certification of real-time systems could work well in these domains, using CPU models and scheduling theory in order to determine the worst-case response times of all tasks running on each device. The use of timing-aware interconnects, such as the CAN bus, could allow system integrators to ensure that safety requirements could be discharged at the top level.
Unfortunately, the theory has not kept pace with reality. The next generation of safety-critical systems will undoubtedly use multicore processors because there is an increasing need for performance and the availability of single-core processors is reducing. It simply is no longer possible to deliver the kinds of applications that are being demanded by consumers on simple, single-core devices over which we have a high degree of certainty.
The currently available techniques for analysing the timing of such multicore devices are limited. Equally, certification authorities are only just beginning to catch up with this new reality meaning that the certification needs of systems using multi-core are not clear. Recently, the civil aviation industry has produced some guidance in the form of CAST-32A, however, this is raising more questions than it is answering.
Safety-critical systems need strong guarantees of their timing behaviors which includes evidence of when the timing requirements are met, and then evidence for the loss of availability of certain functions in the other cases. It is crucially important that both the timing requirement and loss of service are commensurate with the system safety function that comes from the application. The challenge in providing such evidence comes from the platform’s shared resources, e.g., caches and buses. With the introduction of multicore, this has become more complex due to reduced predictability. The unpredictability can be managed through the middleware where the resource management exists, however with appropriate consideration across the three layers during their design and subsequent composition.
This Dagstuhl Seminar aimed to bring together practitioners from three disciplines which represent the three layers (application, middleware, and platform) relevant to safety-critical systems that use multicore to understand how the safety of a system using multicores may be argued; the achievable evidence that can be produced; and how said systems might then be developed.
Program and Structure
This seminar stands at the inflection point for high-integrity systems. There is significant debate and disagreement in both industry and academia about the manner in which future development should progress. It is unknown to what extent it is possible to maintain the same hard real-time guarantees that we have been used to in the past. Modern systems and modern devices simply do not provide the same level of certainty that we have relied upon. Simultaneously, systems are being asked to do even more things for which safety and certification are arguably more important than before.
This seminar was therefore structured to attempt to capture and advance this discussion. The main goal was to put academic leaders and industrial practitioners together in the same room, and to provide space for discussions and disagreements. The overarching goal was to agree as a room on what we believe the most important open questions were, and how research can be best structured over the coming decade to support the growing needs of the industry. Reciprocally, we hoped to capture what industry needs to provide academia, so that its needs can be met.
The first day of the seminar was arranged in advance, and then all subsequent sessions were arranged based on the discussions and the results that were obtained on the previous day. Sessions ran for 75 minutes with two sessions in the morning and two in the afternoon. These sessions usually started with a short talk or a controversial statement followed by extensive discussion. Therefore, in the full report, we report a brief summary of the discussion in the sections instead of a summary of invited talks. When two sections discussed the same topic they are summarized together. One topic that was of specific interest to the participants from academia was the industry perspective; thus, multiple sessions had an industry focus.
Takeaways
Most takeaways from this seminar came from the open-ended structure and heavy focus on collaboration. A key outcome was the codification of a range of the open research challenges in this area and the way that we might as a community work towards them these are detailed in Section 14 of the full report. Other important points that were highlighted through the discussions were:
- Just because a problem is academically interesting to solve, doesn't mean that it is necessarily a key research question. Instead of focussing on specific application domains, it is important to remember the value in more generalised system models and approaches.
- A system designer doesn't specifically care about WCRT, they care about the argumentation chain that will help them discharge their safety requirements. This often involves priorities, deadlines, criticality, and timing analysis, but they are means to an end.
- Industry is very eager to use academic results, but certification requirements mean that they are often unable to do so without mature tool support. Finding a way to solve this problem will massively increase the impact of research.
- The entire certification process is under-served by both academia and industry. Academics have little visibility into the system, and industry is incentivised to remain insular in its approach to the problem. There is a possibility for opening this process up through collaboration and public funding.
- Hardware vendors are unclear on what is actually required by the application layer and the theory. Important future work will attempt to answer just how much predictability we actually need in any given circumstance. This is important because current high-end hardware displays fundamentally unpredictable timing from the perspective of the end user, and so exact timing models cannot exist.
- There is a need to develop more architectural description technologies that can provide more appropriate guarantees that can support timing analysis without being unnecessarily detailed or overly specific. This would involve the standardisation of performance counters which are currently ad hoc.
Feedback from the seminar shows that this intent was captured well. Participant feedback said that "this seminar has had one of the most successful industry-academic collaborations I've experienced in a Dagstuhl Seminar" and others praised the networking opportunities and the "engaging environment for our industrial participants" with "eye-opening panels and discussions". The main weakness of this approach was that the less formalised structure was noted by participants, indicating that a possibility for greater balance exists between discussion and networking opportunities and more formalised talks.
The next generation of safety-critical systems will undoubtedly use multicore processors as there is an increasing need for performance and the availability of single-core processors is reducing. At the same time, practitioners are recognizing that the certification needs of systems using multi-core are not clear and the techniques available to meet any needs that are produced are limited. Recently, the civil aviation industry has produced some guidance in the form of CAST-32A, however, this is raising more questions than it is answering.
Safety-critical systems need strong guarantees of their timing behaviors which includes evidence of when the timing requirements are met, and then evidence for the loss of availability of certain functions in the other cases. It is crucially important that both the timing requirement and loss of service are commensurate with the system safety function which comes from the application. The challenge in providing such evidence comes from the platform’s shared resources, e.g., caches and buses, and with the introduction of multicore, this has become more complex due to reduced predictability. The unpredictability can be managed through the middleware where the resource management exists, however with appropriate consideration across the three layers during their design and subsequent composition.
The aim of this Dagstuhl Seminar is to bring together practitioners from three disciplines which represent the three layers relevant to safety-critical systems that use multicore to understand: how the safety of a system using multicores may be argued; the achievable evidence that can be produced; and how said systems might then be developed. The seminar will be organized through three strands which represent the three key layers of systems: application; middleware; and the platform.
First, a common understanding among practitioners should be found, for each of the three layers determining a set of properties (describing, e.g., timing, performance, and predictability requirements) needed to provide functional safety and its verification and/or a set of functionalities that can be provided supporting functional safety and its verification. Afterward, it should be determined which of the necessary properties are already covered by the provided functionality and which others can realistically be achieved, e.g., by reducing performance to increase predictability, and what are the related costs, e.g., how much is the performance reduced. Furthermore, it should be discussed what solutions are possible to achieve properties that cannot be guaranteed by current hardware or middleware, and how functional requirements can be reduced without reducing the predictability too much.
It is envisaged the seminar would have many outputs and benefits beyond the ability to informally network across companies, institutions, and domains. The key outputs we aim forwill be: a report highlighting what the industry needs in terms of tools and techniques; the dependencies across the various layers; the establishment of some key research challenges; identification of suitable benchmarks for collaborative and comparative research; and finally a route map towards the efficient and effective achievement of assurance arguments.
- Sebastian Altmeyer (Universität Augsburg, DE) [dblp]
- Tanya Amert (Carleton College - Northfield, US) [dblp]
- Matteo Andreozzi (Arm - Cambridge, GB) [dblp]
- Sanjoy Baruah (Washington University - St. Louis, US) [dblp]
- Jan Micha Borrmann (Robert Bosch GmbH - Stuttgart, DE) [dblp]
- Timothy Bourke (INRIA & ENS Paris, FR) [dblp]
- Björn B. Brandenburg (MPI-SWS - Kaiserslautern, DE) [dblp]
- Jian-Jia Chen (TU Dortmund, DE) [dblp]
- Christian Ferdinand (AbsInt - Saarbrücken, DE) [dblp]
- Julien Forget (University of Lille, FR) [dblp]
- Anna Friebe (Mälardalen University - Västerås, SE) [dblp]
- Chris Gill (Washington University - St. Louis, US) [dblp]
- Ian Gray (University of York, GB) [dblp]
- Arpan Gujarati (University of British Columbia - Vancouver, CA) [dblp]
- Robin Hapka (TU Braunschweig, DE) [dblp]
- Mathieu Jan (CEA LIST - Gif-sur-Yvette, FR) [dblp]
- Victor Jegu (Airbus S.A.S. - Toulouse, FR) [dblp]
- Eric Jenn (IRT Antoine de Saint Exupéry - Toulouse, FR) [dblp]
- Mitra Nasri (TU Eindhoven, NL) [dblp]
- Geoffrey Nelissen (TU Eindhoven, NL) [dblp]
- Catherine Nemitz (Davidson College, US) [dblp]
- Claire Pagetti (ONERA - Toulouse, FR) [dblp]
- Sri Parameswaran (University of Sydney, AU) [dblp]
- Rodolfo Pellizzoni (University of Waterloo, CA) [dblp]
- Kevin Quinn (General Dynamics - St Leonards on Sea, GB)
- Jan Reineke (Universität des Saarlandes - Saarbrücken, DE) [dblp]
- Benjamin Rouxel (University of Modena, IT) [dblp]
- Selma Saidi (TU Dortmund, DE) [dblp]
- Matheus Schuh (Kalray - Montbonnot-Saint-Martin, FR) [dblp]
- Zoë Stephenson (Rapita Systems Ltd. - York, GB) [dblp]
- Jürgen Teich (Universität Erlangen-Nürnberg, DE) [dblp]
- Georg von der Brüggen (TU Dortmund, DE) [dblp]
- Bryan Ward (Vanderbilt University - Nashville, US) [dblp]
- Reinhard Wilhelm (Universität des Saarlandes - Saarbrücken, DE) [dblp]
- Houssam-Eddine Zahaf (University of Nantes, FR) [dblp]
Classification
- Hardware Architecture
- Other Computer Science
Keywords
- EDA and Micro-Architectures
- Safety-Critical Applications
- Middleware
- Multi-core