Dagstuhl Seminar 19021
Joint Processing of Language and Visual Data for Better Automated Understanding
( Jan 06 – Jan 11, 2019 )
Permalink
Organizers
- Yun Fu (Northeastern University - Boston, US)
- Marie-Francine Moens (KU Leuven, BE)
- Lucia Specia (Imperial College London, GB)
- Tinne Tuytelaars (KU Leuven, BE)
Contact
- Shida Kunz (for scientific matters)
- Jutka Gasiorowski (for administrative matters)
Schedule
The joint processing of language and visual data has recently received a lot of attention. This emerging research field is stimulated by the active development of deep learning algorithms. For instance, deep neural networks (DNNs) offer numerous opportunities to learn mappings between the visual and language media and to learn multimodal representations of content. Furthermore, deep learning recently has become a standard approach for automated image and video captioning and for visual question answering, the former referring to the automated description of images or video with descriptions in natural language sentences, the latter to the automated formulation of an answer in natural language to a question in natural language about an image.
Apart from aiding image understanding and the indexing and search of image and video data through the natural language descriptions, the field of jointly processing language and visual data builds algorithms for grounded language processing where the meaning of natural language is based on perception and/or actions in the world. Grounded language processing contributes to automated language understanding and machine translation of language. Recently, it has been shown that visual data provide world and common-sense knowledge that is needed in automated language understanding.
Joint processing of language and visual data is also interesting from a theoretical point of view for developing theories on the complementarity of such data in human(-machine) communication, for developing suitable algorithms for learning statistical knowledge representations informed by visual and language data, and for inferencing with these representations.
Given the current trend and results of multimodal (language and vision) research, it can be safely assumed that the joint processing of language and visual data will only gain in importance in the future. During the seminar we will discuss theories, methodologies and real-world technologies for joint processing of language and vision, particularly in the following research areas: (i) Theories of integrated modelling and representation learning of language and vision for computer vision and natural language processing tasks; (ii) Fusion and inference based on visual, language and multimodal representations; (iii) Generation of image and video descriptions and understanding of imagery; (iv) Human language grounding and language understanding; (v) Search and retrieval aided by multimodal processing and representations; and (vi) Machine translation aided by multimodal processing and representations. This will entail the following research questions to be discussed during the seminar (a non-exhaustive list):
- Which machine learning architectures will be best suited for the above tasks?
- How to learn multimodal representations that are relational and structured in nature to allow a structured understanding?
- How to generalize to allow recognitions that have few or zero examples in training?
- How to learn from limited paired data but exploiting monomodal models trained on visual or language data?
- How to explain the neural networks when they are trained for image or language understanding?
- How to disentangle the representations: factorization to separate the different factors of variation and discovering of their meaning?
- How to learn continuous representations that describe semantics and that integrate world and common-sense knowledge?
- How to reason with the continuous representations?
- How to translate to another modality?
- What would be effective novel evaluation metrics?
This Dagstuhl Seminar will bring together an interdisciplinary group of researchers from computer vision, natural language processing, machine learning and artificial intelligence to discuss the latest scientific realizations and to develop a roadmap and research agenda.
The joint processing of language and visual data has recently received a lot of attention. This emerging research field is stimulated by the active development of deep learning algorithms. For instance, deep neural networks (DNNs) offer numerous opportunities to learn mappings between the visual and language media and to learn multimodal representations of content. Furthermore, deep learning recently has become a standard approach for automated image and video captioning and for visual question answering, the former referring to the automated description of images or video with descriptions in natural language sentences, the latter to the automated formulation of an answer in natural language to a question in natural language about an image.
Apart from aiding image understanding and the indexing and search of image and video data through the natural language descriptions, the field of jointly processing language and visual data builds algorithms for grounded language processing where the meaning of natural language is based on perception and/or actions in the world. Grounded language processing contributes to automated language understanding and machine translation of language. Recently, it has been shown that visual data provide world and common-sense knowledge that is needed in automated language understanding.
Joint processing of language and visual data is also interesting from a theoretical point of view for developing theories on the complementarity of such data in human(-machine) communication, for developing suitable algorithms for learning statistical knowledge representations informed by visual and language data, and for inferencing with these representations.
Given the current trend and results of multimodal (language and vision) research, it can be safely assumed that the joint processing of language and visual data will only gain in importance in the future. During the seminar we have discussed theories, methodologies and real-world technologies for joint processing of language and vision, particularly in the following research areas:
- Theories of integrated modelling and representation learning of language and vision for computer vision and natural language processing tasks;
- Explainability and interpretability of the learned representations;
- Fusion and inference based on visual, language and multimodal representations;
- Understanding human language and visual content;
- Generation of language and visual content;
- Relation to human learning;
- Datasets and tasks.
The discussions have attempted to give an answer to the following research questions (a non-exhaustive list):
- Which machine learning architectures will be best suited for the above tasks?
- How to learn multimodal representations that are relational and structured in nature to allow a structured understanding?
- How to generalize to allow recognitions that have few or zero examples in training?
- How to learn from limited paired data but exploiting monomodal models trained on visual or language data?
- How to explain the neural networks when they are trained for image or language understanding?
- How to disentangle the representations: factorization to separate the different factors of variation and discovering of their meaning?
- How to learn continuous representations that describe semantics and that integrate world and common-sense knowledge?
- How to reason with the continuous representations?
- How to translate to another modality?
- What would be effective novel evaluation metrics?
This Dagstuhl Seminar has brought together an interdisciplinary group of researchers from computer vision, natural language processing, machine learning and artificial intelligence to discuss the latest scientific realizations and to develop a roadmap and research agenda.
- Zeynep Akata (University of Amsterdam, NL) [dblp]
- Andrei Barbu (MIT - Cambridge, US) [dblp]
- Loïc Barrault (Université du Mans, FR) [dblp]
- Raffaella Bernardi (University of Trento, IT) [dblp]
- Thales Bertaglia (University of Sheffield, GB) [dblp]
- Ozan Caglayan (Université du Mans, FR) [dblp]
- Stephen Clark (Google DeepMind - London, GB) [dblp]
- Luísa Coheur (INESC-ID - Lisbon, PT) [dblp]
- Guillem Collell (KU Leuven, BE) [dblp]
- Vera Demberg (Universität des Saarlandes, DE) [dblp]
- Desmond Elliott (University of Copenhagen, DK) [dblp]
- Aykut Erdem (Hacettepe University - Ankara, TR) [dblp]
- Erkut Erdem (Hacettepe University - Ankara, TR) [dblp]
- Raquel Fernández (University of Amsterdam, NL) [dblp]
- Orhan Firat (Google Inc. - Mountain View, US) [dblp]
- Anette Frank (Universität Heidelberg, DE) [dblp]
- Stella Frank (University of Edinburgh, GB) [dblp]
- Lisa Anne Hendricks (University of California - Berkeley, US) [dblp]
- David C. Hogg (University of Leeds, GB) [dblp]
- Frank Keller (University of Edinburgh, GB) [dblp]
- Douwe Kiela (Facebook - New York, US) [dblp]
- Dietrich Klakow (Universität des Saarlandes, DE) [dblp]
- Chiraag Lala (University of Sheffield, GB) [dblp]
- Marius Leordeanu (University Politehnica of Bucharest, RO) [dblp]
- Jindrich Libovický (Charles University - Prague, CZ) [dblp]
- Pranava Madhyastha (Imperial College London, GB) [dblp]
- Florian Metze (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Marie-Francine Moens (KU Leuven, BE) [dblp]
- Siddharth Narayanaswamy (University of Oxford, GB) [dblp]
- Jean Oh (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Pavel Pecina (Charles University - Prague, CZ) [dblp]
- Bernt Schiele (MPI für Informatik - Saarbrücken, DE) [dblp]
- Carina Silberer (UPF - Barcelona, ES) [dblp]
- Lucia Specia (Imperial College London, GB) [dblp]
- Tinne Tuytelaars (KU Leuven, BE) [dblp]
- Jakob Verbeek (INRIA - Grenoble, FR) [dblp]
- David Vernon (CMU Africa - Kigali, RW) [dblp]
- Josiah Wang (University of Sheffield, GB) [dblp]
Classification
- artificial intelligence / robotics
- computer graphics / computer vision
- multimedia
Keywords
- Image and video captioning
- human language grounding
- visual understanding
- language understanding
- language generation
- generation of visuals
- world and common sense knowledge
- deep learning