Dagstuhl Seminar 24242: Computational Analysis and Simulation of the Human Voice

Dagstuhl Seminar 24242

Computational Analysis and Simulation of the Human Voice

( Jun 09 – Jun 14, 2024 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/24242

Organizers

Peter Birkholz (TU Dresden, DE)
Oriol Guasch Fortuny (Ramon Llul University - Barcelona, ES)
Nathalie Henrich Bernardoni (University Grenoble Alpes, FR)
Sten Ternström (KTH Royal Institute of Technology - Stockholm, SE)

Contact

Marsha Kleinbauer (for scientific matters)
Susanne Bach-Bernhard (for administrative matters)

Publications

Sten Ternström, Nathalie Henrich Bernardoni, Peter Birkholz, Oriol Guasch, and Amelia Gully. Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242). In Dagstuhl Reports, Volume 14, Issue 6, pp. 84-107, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Schedule

Schedule

Summary

Show Summary

The human voice is able to produce a very rich set of different sounds, making it the single most important channel for communication human-to-human, and also potentially for human-computer interaction. Spoken communication can be thought of as a stack of layered transport protocols that includes language, speech, voice, and sound. This Dagstuhl Seminar was concerned with the voice and its function as a transducer from neurally encoded speech patterns to sound. This very complex mechanism remains insufficiently explained both in terms of analysing voice sounds, as for example in medical assessment of vocal function, and of simulating them from first principles, as in talking or singing machines. There were four main themes to the seminar:

Voice Analysis. Measures derived from voice recordings are clinically attractive, being non-invasive and relatively inexpensive. For clinical voice assessment, however, quantitative objective measures of vocal status have been researched for some seven decades, yet perceptual assessment by listening is still the dominating method. Isolating the properties of a voice (the machine) from those of its owner’s speech or singing (the process) is far from trivial. Computational approaches are expected to facilitate a functional decomposition that can advance beyond conventional cut-off values of metrics and indices.

Voice Visualization. Trained listeners can deduce some of what is going on in the larynx and the vocal tract, but we cannot easily see it or document it. The multidimensionality of the voice poses interesting challenges to the making of effective visualizations. Most current visualizations are textbook transforms of the acoustic signal, but they are not as clinically or pedagogically relevant as they could be. Can functionally or perceptually informed visualizations improve on this situation?

Voice Simulation. Balancing low- and high-order models. A “complete” physics-based computational model of the voice organ would have to account for bidirectional energy exchange between fluids and moving structures at high temporal and spatial resolutions, in 3D. Computational brute force is still not an option to represents voice production in all its complexity, and a proper balance between high and low order approaches has to be found. We discussed strategies for choosing effective partitionings or hybrids of the simulation tasks that could be suitable for specific sub-problems.

Data science and voice research. With today’s machine learning and deep neural network methods, end-to-end systems for both text-to-speech and speech recognition have become remarkably successful, but they remain quite ignorant of the basics of vocal function. Yet machine learning and big data science approaches should be very useful for helping us deal with and account for the variability in voices. Rather than seeking for automated discrimination between normal and pathological voice, clinicians wish for objective assessments of the progress of an intervention, while researchers wish for ways to distil succinct models of voice production from multi-modal big-data observations. We have explored how techniques such as domain-specific feature selection and auto-encoding can make progress toward these goals.

This seminar has resulted in (1) shared knowledge and data about the science of voice from the perspectives of scientists in fields as diverse as computer science, voice pathology and therapy, clinicians, acoustics and audio engineering, electronics, musicology, speech and hearing sciences, physics and mathematics, (2) identifying areas of common interest where significant progress is being made and needs to be made, such as individual voice variability, physical replicas for modelling and validation, synthesis and computational modelling, motor control, and availability of data and resources, (3) sharing and discussing failures to learn lessons and ideas for future developments, and (4) envisioning the future of progress in human voice analysis and simulation in the medium to long term: what is needed to make a big leap forward in this field? These ideas will be captured by the publication of a collaborative article in a leading voice journal.

Creative Commons BY 4.0

Sten Ternström, Nathalie Henrich Bernardoni, Oriol Guasch, and Peter Birkholz

Motivation

Show Motivation

The human voice is able to produce a very rich set of different sounds, making it the single most important channel for communication human-to-human, and also potentially for human-computer interaction. Spoken communication can be thought of as a stack of layered transport protocols that includes language, speech, voice, and sound. In this Dagstuhl seminar, we will be concerned with the voice and its function as a transducer from neurally encoded speech patterns to sound. This very complex mechanism remains insufficiently explained both in terms of analysing voice sounds, as for example in medical assessment of vocal function, and of simulating them from first principles, as in talking or singing machines. There will be four main themes to the seminar:

Voice Analysis: Measures derived from voice recordings are clinically attractive, being non-invasive and relatively inexpensive. For clinical voice assessment, however, quantitative objective measures of vocal status have been researched for some seven decades, yet perceptual assessment by listening is still the dominating method. Isolating the properties of a voice (the machine) from those of its owner’s speech or singing (the process) is far from trivial. We will explore how computational approaches might facilitate a functional decomposition that can advance beyond conventional cut-off values of metrics and indices.

Voice Visualization: Trained listeners can deduce some of what is going on in the larynx and the vocal tract, but we cannot easily see it or document it. The multidimensionality of the voice poses interesting challenges to the making of effective visualizations. Most current visualizations are textbook transforms of the acoustic signal, but they are not as clinically or pedagogically relevant as they might be. Can functionally or perceptually informed visualizations improve on this situation?

Voice Simulation: balancing low- and high-order models. A “complete” physics-based computational model of the voice organ would have to account for bidirectional energy exchange between fluids and moving structures at high temporal and spatial resolutions, in 3D. Computational brute force is still not an option to represents voice production in all its complexity, and a proper balance between high and low order approaches has to be found. We will discuss strategies for choosing effective partitionings or hybrids of the simulation tasks that could be suitable for specific sub-problems.

Data science and voice research: With today’s machine learning and deep neural network methods, end-to-end systems for both text-to-speech and speech recognition have become remarkably successful, but they remain quite ignorant of the basics of vocal function. Yet machine learning and big data science approaches should be very useful for helping us deal with and account for the variability in voices. Rather than seeking for automated discrimination between normal and pathological voice, clinicians wish for objective assessments of the progress of an intervention, while researchers wish for ways to distil succinct models of voice production from multi-modal big-data observations. We will explore how techniques such as domain-specific feature selection and auto-encoding can make progress toward these goals.

We expect that this seminar will result in (1) leading researchers in the vocological community becoming up-to-date on recent computational advances, (2) seasoned computer scientists and data analysts becoming engaged in voice-related challenges, (3) a critical review of the potentials and limitations of deep learning and computational mechanics techniques, as applied to analysis and simulation of the voice, and (4) a week of creative brainstorming, leading to a roadmap for pursuing outstanding issues in computational voice research.

Creative Commons BY 4.0

Peter Birkholz, Oriol Guasch Fortuny, Nathalie Henrich Bernardoni, and Sten Ternström

Participants

Show Participants

Philipp Aichinger (Medizinische Universität Wien, AT) [dblp]
Marc Arnela (Ramon Llul University, ES) [dblp]
Lucie Bailly (Université Grenoble Alpes - Saint Martin d'Hères, FR)
Peter Birkholz (TU Dresden, DE) [dblp]
Meike Brockmann-Bauser (Universitätsspital Zürich, CH)
Helena Daffern (University of York, GB) [dblp]
Michael Döllinger (Universitätsk-Klinikum Erlangen, DE) [dblp]
Mennatallah El-Assady (ETH Zürich, CH) [dblp]
Sidney Fels (University of British Columbia - Vancouver, CA) [dblp]
Mario Fleischer (Charité - Berlin, DE) [dblp]
Andrés Goméz-Rodellar (NeuSpeLab - Las Rozas de Madrid, ES) [dblp]
Pedro Gomez-Vilda (NeuSpeLab - Las Rozas de Madrid, ES) [dblp]
Oriol Guasch Fortuny (Ramon Llul University - Barcelona, ES) [dblp]
Amelia Gully (University of York, GB) [dblp]
Nathalie Henrich Bernardoni (University Grenoble Alpes, FR) [dblp]
Eric Hunter (University of Iowa, US)
Filipa M.B. Lã (UNED - Madrid, ES) [dblp]
Yves Laprie (LORIA - Nancy, FR) [dblp]
Sarah Lehoux (UCLA, US)
Matthias Miller (ETH Zürich, CH) [dblp]
Scott Reid Moisik (Nanyang TU - Singapore, SG) [dblp]
Peter Pabon (Utrecht, NL) [dblp]
Jean Schoentgen (Free University of Brussels, BE) [dblp]
Brad Story (University of Arizona - Tucson, US) [dblp]
Johan Sundberg (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Sten Ternström (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Tino Weinkauf (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Qian Xue (Rochester Institute of Technology, US)
Zhaoyan Zhang (UCLA, US) [dblp]

Classification

Machine Learning
Sound

Keywords

voice analysis
voice simulation
health care
visualization

Seminar 24242

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 24242

Computational Analysis and Simulation of the Human Voice

( Jun 09 – Jun 14, 2024 )

Permalink

Organizers

Contact

Publications

Schedule

Summary

Motivation

Participants

Classification

Keywords