Dagstuhl Seminar 22342
Privacy in Speech and Language Technology
( Aug 21 – Aug 26, 2022 )
Permalink
Organizers
- Simone Fischer-Hübner (Karlstad University, SE)
- Dietrich Klakow (Universität des Saarlandes, DE)
- Peggy Valcke (KU Leuven, BE)
- Emmanuel Vincent (INRIA Nancy - Grand Est, FR)
Contact
- Michael Gerke (for scientific matters)
- Simone Schilke (for administrative matters)
Schedule
In the last few years, voice assistants have become the preferred means of interacting with smart devices and services. Chatbots and related language technologies such as machine translation or typing prediction are also widely used. These technologies often rely on cloud-based machine learning systems trained on speech or text data collected from the users. The recording, storage and processing of users' speech or text data raises severe privacy threats. This data contains a wealth of personal information about, e.g., the personality, ethnicity and health state of the user, that may be (mis)used for targeted processing or advertisement. It also includes information about the user identity which could be exploited by an attacker to impersonate him/her. News articles exposing these threats to the general public have made national headlines.
A new generation of privacy-preserving speech and language technologies is needed that ensures user privacy while still providing users with the same benefits and companies with the training data needed to develop these technologies. Recent regulations such as the European General Data Protection Regulation (GDPR), which promotes the principle of privacy-by-design, have further fueled interest. Yet, efforts in this direction have suffered from the lack of collaboration across research communities. This Dagstuhl Seminar was the first event to bring 6 relevant disciplines and communities together: Speech Processing, Natural Language Processing, Privacy Enhancing Technologies, Machine Learning, Human Factors, and Law.
After 6 tutorials given from the perspective of each of these 6 disciplines, the attendees gathered into cross-disciplinary working groups on 6 topics. The first group analyzed the privacy threats and the level of user control for a few case studies. The second group focused on anonymization of unstructured speech data and discussed the legal validity of the success measures developed in the speech processing literature. The third group devoted special interest to vulnerable groups of users in regard to the current laws in various countries. The fifth group tackled the design of privacy attacks against speech and text data. Finally, the sixth group explored the legal interpretation of emerging privacy enhancing technologies.
The reports of these 6 working groups, which are gathered in the following, constitute the major result from the seminar. We consider them as a first step towards a full-fledged interdisciplinary roadmap for the development of private-by-design speech and language technologies addressing societal and industrial needs.
In the last few years, voice assistants have become the preferred means of interacting with smart devices and services. Chatbots and related technologies such as automated translation or typing prediction are also widely used. These technologies often rely on cloud-based machine learning systems trained on speech or text data collected from the users.
The recording, storage and processing of users' speech or text data raises severe privacy threats. This data contains a wealth of personal information about, e.g., the personality, ethnicity and health state of the user, that may be (mis)used for targeted processing or advertisement. It also includes information about the user identity which could be exploited by an attacker to impersonate him/her. News articles exposing these threats to the general public have made national headlines.
A new generation of privacy-preserving speech and language technologies is needed that ensures user privacy while still providing users with the same benefits and companies with the training data needed to develop these technologies. Recent regulations such as the European General Data Protection Regulation (GDPR), which promotes the principle of privacy-by-design, have further fueled interest. Yet, efforts in this direction have suffered from the lack of collaboration across research communities. These include the development of encryption tools such as homomorphic encryption and secure multiparty computation, machine learning frameworks such as federated or decentralized learning, and anonymization techniques targeting speech and language specifically. Privacy in speech and language technology also recently attracted the interest of law researchers and data protection authorities.
To the best of our knowledge, this Dagstuhl Seminar will be the first event that aims to bring together academic researchers, industry representatives, and policy makers in the fields of speech processing, natural language processing, privacy-enhancing technologies (PETs), machine learning, and law and ethics, in order to draw cross-disciplinary solutions. The questions to be addressed include (but are not limited to) the following:
- What are the threats to privacy arising from the recording, storage and processing of user-generated speech and language data? What is their probability of occurrence and their impact?
- What are the related ethical and moral issues?
- How shall those threats be translated into actionable, formal privacy models? Do existing general-purpose privacy models apply or are new domain-specific models needed?
- Which existing PETs can be leveraged to address privacy requirements regarding raw speech and language data? How shall they be combined into holistic solutions?
- How should secondary data, e.g., models trained on raw data, be treated?
- Which new PETs are being developed? Can they benefit from cross-disciplinary collaboration?
- What privacy goals can these PETs achieve? Which metrics shall be used to assess their success?
- How shall these PETs be implemented in practice, so as to provide transparent information and management capabilities to the users? How can formal guarantees be made and explained?
- What are the expected limitations of these PETs? What is the research roadmap to address them?
- How will privacy laws affect these new developments? Conversely, how will they be impacted by these new developments?
The Dagstuhl Seminar will involve of a mix of plenary talks and subgroup discussions aiming to achieve a shared understanding of problems and solutions and to sketch a cross-disciplinary roadmap we hope to publish as a joint position paper. Besides, there will be multiple breaks for invitees to socialize and make new cross-disciplinary collaborations emerge.
D. Klakow and E. Vincent acknowledge support from the European Union's Horizon 2020 Research and Innovation Program within project COMPRISE "Cost-effective, multilingual, privacy-driven voice-enabled services" (www.compriseh2020.eu).
- Lydia Belkadi (KU Leuven, BE)
- Zinaida Benenson (Universität Erlangen-Nürnberg, DE) [dblp]
- Martine De Cock (University of Washington - Tacoma, US)
- Abdullah Elbi (KU Leuven, BE)
- Zekeriya Erkin (TU Delft, NL) [dblp]
- Natasha Fernandes (Macquarie University - Sydney, AU)
- Simone Fischer-Hübner (Karlstad University, SE) [dblp]
- Ivan Habernal (TU Darmstadt, DE) [dblp]
- Meiko Jensen (Karlstad University, SE) [dblp]
- Els Kindt (KU Leuven, BE) [dblp]
- Dietrich Klakow (Universität des Saarlandes, DE) [dblp]
- Katherine Lee (Google - New York, US)
- Anna Leschanowsky (Fraunhofer IIS - Erlangen, DE)
- Pierre Lison (Norwegian Computing Center, NO)
- Christina Lohr (Friedrich-Schiller-Universität Jena, DE) [dblp]
- Emily Mower Provost (University of Michigan - Ann Arbor, US) [dblp]
- Andreas Nautsch (Université d'Avignon, FR) [dblp]
- Olya Ohrimenko (The University of Melbourne, AU) [dblp]
- Jo Pierson (Hasselt University & VU Brussels, BE) [dblp]
- Laurens Sion (KU Leuven, BE) [dblp]
- David Stevens (Autorité de la protection des données - Brussels, BE)
- Francisco Teixeira (INESC-ID - Lisboa, PT)
- Natalia Tomashenko (Université d'Avignon, FR) [dblp]
- Marc Tommasi (University of Lille, FR) [dblp]
- Peggy Valcke (KU Leuven, BE) [dblp]
- Emmanuel Vincent (INRIA Nancy - Grand Est, FR) [dblp]
- Shomir Wilson (Pennsylvania State University, US)
Classification
- Computation and Language
- Computers and Society
- Cryptography and Security
Keywords
- Speech and language technology
- Privacy
- Data protection
- Privacy-enhancing technologies
- Law and policy