Dagstuhl-Seminar 15081
Holistic Scene Understanding
( 15. Feb – 20. Feb, 2015 )
Permalink
Organisatoren
- Jiri Matas (Czech Technical University, CZ)
- Vittorio Murino (Italian Institute of Technology - Genova, IT)
- Bodo Rosenhahn (Leibniz Universität Hannover, DE)
Koordinator
- Laura Leal-Taixé (ETH Zürich, CH)
Kontakt
- Susanne Bach-Bernhard (für administrative Fragen)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
Understanding the scene in an image or video requires much more than recording and storing it, extracting some features and eventually recognizing objects. Ultimately, the overall goal is to find a mapping to derive semantic information from sensor data. Besides, purposive scene understanding may require different representations for different specific tasks and, actually, the task itself can be used as driver for the subsequent data processing. However, there is still the need of capturing local, global and dynamic aspects of the acquired observations, which are to be utilized to understand what is occurring in a scene. For example, one might be interested to realize from an image if there is a person present or not and where, and beyond that, to look for its specific pose, e.g., if the person is sitting, walking or raising a hand, etc.. When people move in a scene, the specific time (e.g., 7:30 in the morning, workdays, weekend), the weather (e.g., rain), objects (e.g., cars, a bus approaching a bus stop, crossing bikes, etc.) or surrounding people (crowded, fast moving people) yield to a mixture of low-level and high-level, as well as abstract cues, which need to be jointly analyzed to get an profound understanding of a scene. In other words, generally speaking, all information which is possible to extract from a scene must be considered in context in order to get a comprehensive scene understanding, but this information, while it is easily captured by humans, is still difficult to obtain from a machine.
Next generation recognition systems require a full, holistic, understanding of the scene components and their dynamics in order to cope more and more effectively with real applications like car driver assistance, urban design, surveillance, and many others.
With such topics in mind, the aim of this seminar is to discuss which are the sufficient and necessary elements for a complete scene understanding, i.e. what it really means to understand a scene. Specifically, in this seminar, we want to explore methods that are capable of representing a scene at different level of semantic granularity and modeling various degrees of interactions between objects, humans and 3D space. For instance, a scene-object interaction describes the way a scene type (e.g., a dining room or a bedroom) influences the probability of an objects' presence, and vice versa. The 3D layout of the environment (e.g., walls, floors, etc.) biases the placements of objects and humans in the scene, and also affects the way they interact. An object-object or object-human interaction describes the way objects, humans and their pose affect each other (e.g., a dining table suggests that a set of chairs are to be found around it). In other words, the 3D configuration of the environment and the relative placements and poses of the objects and humans therein, the associated dynamics (relative distance, human body posture and gesture, gazing, etc.), as well as other contextual information (e.g., weather, temperature, etc.) support the holistic understanding of the observed scene. Since many scenes involve humans, we are also interested in discussing novel methods for analyzing group activities and human interactions at different levels of spatial and semantic resolution.
In this sense, understanding a visual scene requires multidisciplinary discussions between scientists in Computer Vision, Machine Learning, but also Robotics, Computer Graphics, Mathematics, Natural Language Processing and Cognitive Sciences. Additionally, disciplines like Psychology, Anthropology, Sociology, Linguistics or Neuroscience touch upon this problem, which is inherent in the human comprehension of the environment and our social lives. Rarely these communities get a possibility to share their views on this same topic.
We will gather not only researchers well-known in Computer Vision areas such as object detection, classification, motion segmentation, crowd and group behavior analysis or 3D scene reconstruction, but also Computer Vision affiliated people from the aforementioned communities in order to share each others point of view on the common topic of scene understanding.
Motivations
To understand a scene in a given image or video is much more than to simply record and store it, extract some features and eventually recognize an object. The overall goal is to find a mapping to derive semantic information from sensor data. Purposive Scene understanding may require a different representation for different specific tasks. The task itself can be used as prior but we still require an in-depth understanding and balancing between local, global and dynamic aspects which can occur within a scene. For example, an observer might be interested to understand from an image if there is a person present or not, and beyond that, if it is possible to look for more information, e.g. if the person is sitting, walking or raising a hand, etc.
When people move in a scene, the specific time (e.g. 7:30 in the morning, workdays, weekend), the weather (e.g. rain), objects (cars, a bus approaching a bus stop, crossing bikes, etc.) or surrounding people (crowded, fast moving people) yield to a mixture of low-level and high-level, as well as abstract cues, which need to be jointly analyzed to get an in-depth understanding of a scene. In other words, generally speaking, the so-called extit{context} is to be considered for a comprehensive scene understanding, but this information, while it is easily captured by human beings, is still difficult to obtain from a machine.
Holistic scene interpretation is crucial to design the next generation of recognition systems, which are important for several applications, e.g. driver assistance, city modeling and reconstruction, outdoor motion capture and surveillance.
With such topics in mind, the aim of this workshop was to discuss which are the sufficient and necessary elements for a complete scene understanding, i.e. what it really means to understand a scene. Specifically, in this workshop, we wanted to explore methods that are capable of representing a scene at different level of semantic granularity and modeling various degrees of interactions between objects, humans and 3D space. For instance, a scene-object interaction describes the way a scene type (e.g. a dining room or a bedroom) influences objects' presence, and vice versa. An object-3D-layout or human-3D-layout interaction describes the way the 3D layout (e.g. the 3D configuration of walls, floor and observer's pose) biases the placement of objects or humans in the image, and vice versa. An object-object or object-human interaction describes the way objects, humans and their pose affect each other (e.g. a dining table suggests that a set of chairs are to be found around it). In other words, the 3D configuration of the environment and the relative placements and poses of the objects and humans therein, the associated dynamics (relative distance, human body posture and gesture, gazing, etc.), as well as other contextual information (e.g., weather, temperature, etc.) support the holistic understanding of the observed scene.
As part of a larger system, understanding a scene semantically and functionally allows to make predictions about the presence and locations of unseen objects within the space, and thus predict behaviors and activities that are yet to be observed. Combining predictions at multiple levels into a global estimate can improve each individual prediction.
Since most scenes involve humans, we were also interested in discussing novel methods for analyzing group activities and human interactions at different levels of spatial and semantic resolution. As advocated in recent literature, it is beneficial to solve the problem of tracking individuals and understand their activities in a joint fashion by combining bottom-up evidence with top-down reasoning as opposed to attack these two problems in isolation.
Top-down constraints can provide critical contextual information for establishing accurate associations between detections across frames and, thus, for obtaining more robust tracking results. Bottom-up evidence can percolate upwards so as to automatically infer action labels for determining activities of individual actors, interactions among individuals and complex group activities. But of course there is more than this, it is indeed the cooperation of both data flows that makes the inference more manageable and reliable in order to improve the comprehension of a scene.
We gathered researchers which are not only well-known in Computer Vision areas such as object detection, classification, motion segmentation, crowd and group behavior analysis or 3D scene reconstruction, but also Computer Vision affiliated people from other communities in order to share each others point of view on the common topic of scene understanding.
Goals
Our main goals of the seminar can be summarized as follows:
- Address holistic scene understanding, a topic that has not been discussed before in detail at previous seminars, with special focus on a multidisciplinary perspective for sharing or competing the different views.
- Gather well-known researchers from the Computer Vision, Machine Learning, Social Sciences (e.g. Cognitive Psychology), Neuroscience, Robotics and Computer Graphics communities to compare approaches to representing scene geometry, dynamics, constraints as well as problems and task formulations adopted in these fields. The interdisciplinary scientific exchange is likely to enrich the communities involved.
- Create a platform for discussing and bridging topics like perception, detection, tracking, activity recognition, multi-people multi-object interaction and human motion analysis, which are surprisingly treated independently in the communities.
- Publication of an LNCS post-proceedings as previously done for the 2006, 2008 and 2010 seminars. These will include the scientific contributions of participants of the Seminar, focusing specially on the discussed topics presented at the Seminar.
Organization of the seminar
During the workshop we discussed different modeling techniques and experiences researchers have collected. We discussed sensitivity, time performance and e.g. numbers of parameters required for special algorithms and the possibilities for context-aware adaptive and interacting algorithms. Furthermore, we had extensive discussions on open questions in these fields.
On the first day, the organizers provided general information about Dagstuhl seminars, the philosophy behind Dagstuhl and the expectations to the participants. We also clarified the kitchen-rules and organized a running-group for the early mornings (5 people participated frequently!).
Social event.
On Wednesday afternoon we organized two afternoon event: One group made a trip to Trier, and another group went on a 3h hike in the environment.
Working Groups.
To strongly encourage discussions during the seminar, we organized a set of working groups on the first day (with size between 8--12 people). As topics we selected
- What does "Scene Understanding" mean ?
- Dynamic Scene: Humans.
- Recognition in static scenes (in 3D).
There were two afternoon slots reserved for these working groups and the outcome of the working groups has been presented in the Friday morning session.
LNCS Post-Proceedings.
We will edit a Post-Proceeding and invite participants to submit articles. In contrast to standard conference articles, we allow for more space (typically 25 single-column pages) and allow to integrate open questions or preliminary results, ideas, etc. from the seminar into the proceedings. Additionally, we will enforce joint publications of participants who started to collaborate after the seminar. All articles will be reviewed by at least two reviewers and based on the evaluation, accepted papers will be published. We will publish the proceeding at the Lecture Notes in Computer Science (LNCS-Series) by Springer. The papers will be collected during the summer months.
Overall, it was a great seminar and we received very positive feedback from the participants. We would like to thank castle Dagstuhl for hosting the event and are looking forward to revisit Dagstuhl whenever possible.
- Gabriel Brostow (University College London, GB) [dblp]
- Marco Cristani (University of Verona, IT) [dblp]
- Alessio Del Bue (Italian Institute of Technology - Genova, IT) [dblp]
- Markus Enzweiler (Daimler AG - Böblingen, DE) [dblp]
- Michele Fenzi (Leibniz Universität Hannover, DE) [dblp]
- Sanja Fidler (University of Toronto, CA) [dblp]
- Robert Fisher (University of Edinburgh, GB) [dblp]
- Jan-Michael Frahm (University of North Carolina at Chapel Hill, US) [dblp]
- Jürgen Gall (Universität Bonn, DE) [dblp]
- Shaogang Gong (Queen Mary University of London, GB) [dblp]
- Abhinav Gupta (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Michal Havlena (ETH Zürich, CH) [dblp]
- Roberto Henschel (Leibniz Universität Hannover, DE) [dblp]
- Esther Horbert (RWTH Aachen, DE) [dblp]
- Jörn Jachalsky (Technicolor - Hannover, DE) [dblp]
- Ron Kimmel (Technion - Haifa, IL) [dblp]
- Reinhard Klette (University of Auckland, NZ) [dblp]
- Laura Leal-Taixé (ETH Zürich, CH) [dblp]
- Oisin Mac Aodha (University College London, GB) [dblp]
- Jiri Matas (Czech Technical University, CZ) [dblp]
- Greg Mori (Simon Fraser University - Burnaby, CA) [dblp]
- Vittorio Murino (Italian Institute of Technology - Genova, IT) [dblp]
- Caroline Pantofaru (Google Inc. - Mountain View, US) [dblp]
- Matthias Reso (Leibniz Universität Hannover, DE) [dblp]
- Anna Rohrbach (MPI für Informatik - Saarbrücken, DE) [dblp]
- Bodo Rosenhahn (Leibniz Universität Hannover, DE) [dblp]
- Bernt Schiele (MPI für Informatik - Saarbrücken, DE) [dblp]
- Konrad Schindler (ETH Zürich, CH) [dblp]
- Min Sun (National Tsing Hua University - Hsinchu, TW) [dblp]
- Raquel Urtasun (University of Toronto, CA) [dblp]
- Sebastiano Vascon (Italian Institute of Technology - Genova, IT) [dblp]
- Stefan Walk (Qualcomm Austria Research Center GmbH, AT) [dblp]
- Jan Dirk Wegner (ETH Zürich, CH) [dblp]
- Michael Yang (Leibniz Universität Hannover, DE) [dblp]
- Angela Yao (Universität Bonn, DE) [dblp]
Verwandte Seminare
- Dagstuhl-Seminar 9411: Theoretical Foundations of Computer Vision (1994-03-14 - 1994-03-18) (Details)
- Dagstuhl-Seminar 9612: Theoretical Foundations of Computer Vision (1996-03-18 - 1996-03-22) (Details)
- Dagstuhl-Seminar 98111: Evaluation and Validation of Computer Vision Algorithms (1998-03-16 - 1998-03-20) (Details)
- Dagstuhl-Seminar 00111: Multi-Image Search, Filtering, Reasoning and Visualisation (2000-03-12 - 2000-03-17) (Details)
- Dagstuhl-Seminar 02151: Theoretical Foundations of Computer Vision -- Geometry, Morphology, and Computational Imaging (2002-04-07 - 2002-04-12) (Details)
- Dagstuhl-Seminar 04251: Imaging Beyond the Pin-hole Camera. 12th Seminar on Theoretical Foundations of Computer Vision (2004-06-13 - 2004-06-18) (Details)
- Dagstuhl-Seminar 06241: Human Motion - Understanding, Modeling, Capture and Animation. 13th Workshop on Theoretical Foundations of Computer Vision (2006-06-11 - 2006-06-16) (Details)
- Dagstuhl-Seminar 08291: Statistical and Geometrical Approaches to Visual Motion Analysis. 14th Workshop “Theoretic Foundations of Computer Vision” (2008-07-13 - 2008-07-18) (Details)
- Dagstuhl-Seminar 11261: Outdoor and Large-Scale Real-World Scene Analysis. 15th Workshop "Theoretic Foundations of Computer Vision" (2011-06-26 - 2011-07-01) (Details)
Klassifikation
- artificial intelligence / robotics
- computer graphics / computer vision
- modelling / simulation
Schlagworte
- Scene Analysis
- Image Understanding
- Crowd Analysis
- People and Object Recognition