Dagstuhl Seminar 17301
User-Generated Content in Social Media
( Jul 23 – Jul 28, 2017 )
Permalink
Organizers
- Tat-Seng Chua (National University of Singapore, SG)
- Norbert Fuhr (Universität Duisburg-Essen, DE)
- Gregory Grefenstette (IHMC - Paris, FR)
- Kalervo Järvelin (University of Tampere, FI)
- Jaakko Peltonen (Aalto University, FI)
Contact
- Jutka Gasiorowski (for administrative matters)
Social media play a central role in many people’s lives, and they also have a profound impact on businesses and society. Users post vast amounts of content (text, photos, audio, video) every minute. This user generated content (UGC) has become increasingly multimedia in nature. It documents users’ lives, revealing in real time their interests and concerns and activities in society. The analysis of UGC can offer insights to individual and societal concerns and could be beneficial to a wide range of applications, for example, tracking mobility in cities, identifying citizen’s issues, opinion mining, and much more.
In contrast to classical media, social media thrive by allowing anyone to publish content with few constraints and no oversight. Social media posts thus show great variation in length, content, quality, language, speech and other aspects. This heterogeneity poses new challenges for standard content access and analysis methods. On the other hand, UGC is often related to other public information (e.g. product reviews or discussion of news articles), and there often is rich contextual information linking, which allows for new types of analyses.
In this seminar, we aim at discussing the specific properties of UGC, the general research tasks currently operating on this type of content, identifying their limitations and lacunae, and imagining new types of applications made possible by the availability of vast amounts of UGC.
We will identify specific properties of UGC such as presentation quality and style, bias and subjectivity of content, credibility of sources, contradictory statements, and heterogeneity of language and media.
We will discuss current applications exploiting UGC, like e. g. sentiment analysis, noise removal, indexing and retrieving UGC, recommendation and selection methods, summarization methods, credibility and reliability estimation, topic detection and tracking, topic development analysis and prediction, community detection, modeling of content and user interest trends, collaborative content creation, cross media and cross lingual analysis, multi-source and multi-task analysis, social media sites. Live and real-time analysis of streaming data, and machine learning for big data analytics of UGC. These applications and methods involve contributions from several data analysis and machine learning research directions.
We will imagine new applications exploiting UGC in areas such as e.g. marketing, political campaigns, crisis management, eHealth, customer reviews for shopping, socio-political analyses, smart city, user mobility analysis, and user profiling.
The general goal of this seminar is to collect, discuss and understand the state-of-the-art in research on UGC, to identify unresolved problems and to define a research agenda for further work in this area. We are especially interested in properties, tasks, and applications that span multiple content types or different social media systems.
About three months before the seminar, we will ask invited participants to fill out an online questionnaire, describing their specific interest in the topic, naming crucial and urgent research issues, stating their willingness to give a survey talk or give a more specific presentation, as well as proposing working group topics. Based on this input, we will schedule about 4-5 longer survey talks, each followed by a few short, specialized presentations from the respective areas. In addition, we will reserve at least two half days for working group sessions on topics chosen by the participants.
Social media play a central role in many people's lives, and they also have a profound impact on businesses and society. Users post vast amounts of content (text, photos, audio, video) every minute. This user generated content (UGC) has become increasingly multimedia in nature. It documents users' lives, revealing in real time their interests and concerns and activities in society. The analysis of UGC can offer insights to individual and societal concerns and could be beneficial to a wide range of applications, for example, tracking mobility in cities, identifying citizen's issues, opinion mining, and much more. In contrast to classical media, social media thrive by allowing anyone to publish content with few constraints and no oversight. Social media posts thus show great variation in length, content, quality, language, speech and other aspects. This heterogeneity poses new challenges for standard content access and analysis methods. On the other hand, UGC is often related to other public information (e.g. product reviews or discussion of news articles), and there often is rich contextual information linking, which allows for new types of analyses.
In this seminar, we aimed at discussing the specific properties of UGC, the general research tasks currently operating on this type of content, identifying their limitations and lacunae, and imagining new types of applications made possible by the availability of vast amounts of UGC. This type of content has specific properties such as presentation quality and style, bias and subjectivity of content, credibility of sources, contradictory statements, and heterogeneity of language and media. Current applications exploiting UGC include sentiment analysis, noise removal, indexing and retrieving UGC, recommendation and selection methods, summarization methods, credibility and reliability estimation, topic detection and tracking, topic development analysis and prediction, community detection, modeling of content and user interest trends, collaborative content creation, cross media and cross lingual analysis, multi-source and multi-task analysis, social media sites, live and real-time analysis of streaming data, and machine learning for big data analytics of UGC. These applications and methods involve contributions from several data analysis and machine learning research directions.
This seminar brought together researchers from different subfields of computer science, such as information retrieval, multimedia, natural language processing, machine learning and social media analytics. After participants gave presentations of their current research orientations concerning UGC, we decided to split into two Working Groups: (WG-1) Fake News and Credibility, and (WG-2) Summarizing and Storytelling from UGC.
WG-1: Fake News and Credibility
WG-1 began discussing the concept of Fake News, and we arrived at the conclusion that it was a topic with much nuance, and that a hard and fast definition of what was fake and what was real news would be hard to define. We then concentrated on deciding what elements of Fake (or Real) News could be calculated or quantified by computer. This led us to construct a list of text quality measures that have or are being studied in the Natural Language Processing community: Factuality, Reading Level, Virality, Emotion, Opinion, Controversy, Authority, Technicality, and Topicality. During this discussion, WG-1 invented and mocked up what we called an Information Nutrition Label, modeled after nutritional labels found on most food products nowadays. We feel that it would be possible to produce some indication of the "objective" value of a text using the above nine measures. The user could use these measures to judge for themselves whether a given text was "fake" or "real". For example, a text highly charged in Emotion, Opinion, Controversy, and Topicality might be Fake News for a given reader. Just like with a food nutritional label, a reader might use the Information Nutritional Label to judge whether a given news story was "healthy" or not.
WG-1 split into further subgroups to explore whether current status of research in the nine areas: Factuality, Reading Level, etc. For each topic, the subgroups sketched out the NLP task involved, found current packages, testbeds and datasets for the task, and provided recent bibliography for the topic. Re-uniting in one larger group, each subgroup reported on their findings, and we discussed next steps, envisaging the following options: a patent covering the idea, creating a startup that would implement all nine measures and produce a time-sensitive Information Nutritional Label for any text submitted to it, a hackathon that would ask programmers to create packages for any or all of the measurements, a further workshop around the Information Nutrition label, integration of the INL into teaching of Journalists, producing a joint article describing the idea. We opted for the final idea, and we produced a submission (also attached to this report) for the Winter issue of the SIGIR (Special Interest Group on Information Retrieval) Forum.
WG-2: Summarizing and Storytelling from UGC
WG-2 set out to re-examine the topic of summarization. Although this is an old topic, but in the era of user-generated content with accelerated rates of information creation and dissemination, there is a strong need to re-examine this topic from the new perspectives of timeliness, huge volume, multiple sources and multimodality. The temporal nature of this problem also brings it to the realm of storytelling, which is done separately from that of summarization. We thus need to move away from the traditional single source document-based summarization, by integrating summarization and storytelling, and refocusing the problem space to meet the new challenges.
We first split the group into two sub-groups, to discuss separately: (a) the motivations and scopes, and (b) the framework of summarization. The first sub-group discussed the sources of information for summarization including, the user-generated content, various authoritative information sources such as the news and Wikipedia, the sensor data, open data and proprietary data. The data is multilingual and multimodal, and often in real time. The group then discussed storytelling as a form of dynamic summarization. The second group examined the framework for summarization. It identified the key pipeline processes comprising of: data ingestion, extraction, reification, knowledge representation, followed by story generation. In particular, the group discussed the roles of time and location in data, knowledge and story representation.
Finally, the group identified key challenges and applications of the summarization framework. The key challenges include multi-source data fusion, multilinguality and multimodality, the handling of time/ temporality/ history, data quality assessment and explainability, knowledge update and renewal, as well as focused story/ summary generation. The applications that can be used to focus the research includes event detection, business intelligence, entertainments and wellness. The discussions have been summarized into a paper entitled "Rethinking Summarization and Storytelling for Modern Social Multimedia". The paper is attached along with this report. It has been submitted to a conference for publication.
- Tat-Seng Chua (National University of Singapore, SG) [dblp]
- Nicolas Diaz-Ferreyra (Universität Duisburg-Essen, DE) [dblp]
- Gerald Friedland (University of California - Berkeley, US) [dblp]
- Norbert Fuhr (Universität Duisburg-Essen, DE) [dblp]
- Anastasia Giachanou (University of Lugano, CH) [dblp]
- Tatjana Gornostaja (tilde - Riga, LV) [dblp]
- Gregory Grefenstette (IHMC - Paris, FR) [dblp]
- Iryna Gurevych (TU Darmstadt, DE) [dblp]
- Andreas Hanselowski (TU Darmstadt, DE) [dblp]
- Xiangnan He (National University of Singapore, SG) [dblp]
- Benoit Huet (EURECOM - Sophia Antipolis, FR) [dblp]
- Kalervo Järvelin (University of Tampere, FI) [dblp]
- Rosie Jones (Microsoft New England R&D Center - Cambridge, US) [dblp]
- Rianne Kaptein (Crunchr - Amsterdam, NL) [dblp]
- Krister Lindén (University of Helsinki, FI) [dblp]
- Yiqun Liu (Tsinghua University - Beijing, CN) [dblp]
- Marie-Francine Moens (KU Leuven, BE) [dblp]
- Josiane Mothe (University of Toulouse, FR) [dblp]
- Wolfgang Nejdl (Leibniz Universität Hannover, DE) [dblp]
- Jaakko Peltonen (Aalto University, FI) [dblp]
- Isabella Peters (ZBW – Dt. Zentralbib. Wirtschaftswissenschaften, DE) [dblp]
- Miriam Redi (NOKIA Bell Labs - Cambridge, GB) [dblp]
- Stevan Rudinac (University of Amsterdam, NL) [dblp]
- Markus Schedl (Universität Linz, AT) [dblp]
- David Ayman Shamma (CWI - Amsterdam, NL) [dblp]
- Alan Smeaton (Dublin City University, IE) [dblp]
- Benno Stein (Bauhaus-Universität Weimar, DE) [dblp]
- Lexing Xie (Australian National University - Canberra, AU) [dblp]
Classification
- data bases / information retrieval
- multimedia
- world wide web / internet
Keywords
- Social media
- information extraction
- multimedia retrieval and annotation
- trend detection
- e-reputation