Dagstuhl Seminar 23191
Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics
( May 07 – May 12, 2023 )
Permalink
Organizers
- Timothy Baldwin (MBZUAI - Abu Dhabi, AE)
- William Croft (University of New Mexico - Alburquerque, US)
- Joakim Nivre (Uppsala University, SE)
- Agata Savary (University Paris-Saclay, CNRS - Orsay, FR)
Contact
- Michael Gerke (for scientific matters)
- Christina Schwarz (for administrative matters)
External Homepage
Schedule
he Dagstuhl Seminar 23191 entitled "Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics" was an accomplishment of long-standing efforts, initiated as early as in October 2018. We submitted at that time a Dagstuhl Seminar proposal which was selected to take place in Dagstuhl on June 21-26, 2020. Due to the Corona/COVID-19 pandemic, the event was first re-scheduled to August 29 to September 3, 2021, and finally transformed into a reduced online seminar under the same title on August 30-31, 2021 [1]. Despite its very reduced format, the seminar achieved part of its objectives and provided a proof of concept of the initial proposal. Following the encouragement from the participants, we re-submitted roughly the same proposal in November 2021 for a full-fledged on-site event. It was then selected to take place in Dagstuhl on May 7-12, 2023}.
The objectives, following the initial 2018 proposal, were threefold:
- Theoretical: To deepen the understanding of language universals, and of linguistic idiosyncrasy in particular, so as to further promote unified modelling while preserving diversity.
- Practical: To harness idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for very many languages.
- Networking: To promote a higher degree of convergence to universalism-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.
The program of the event followed the Dagstuhl model:
- A list of recommended readings was published prior to the event.
- Recordings from the introductory talks, given by the 4 organizers at the 2021 online seminar, ensured common understanding of the terminology, scope and challenges to address.
- Personal introductions of all participants helped achieve a community building effect.
- Six outstanding speakers were invited to give plenary inspirational talks.
- Working groups (WGs) were built in a bottom-up manner on the basis of discussion issues submitted by the participants. WGs ran in parallel, were coordinated and minuted by two co-leaders each, and were organized in the following settings.
- For days 1 and 2 (Monday-Tuesday) the discussion issues were submitted by the participants prior to the event. On this basis 5 WGs were formed:
- WG1 - Below and beyond word boundaries (co-leaders: Daniel Zeman and Reut Tsarfaty)
- WG2 - Annotation of particular kinds of constructions (co-leaders: Manfred Sailer and Nathan Schneider)
- WG3 - Representing the semantics of MWEs (co-leaders: Dag Haug and Nianwen Xue)
- WG4 - Finding idiosyncrasy in corpora (co-leaders: Francis Bond and Nurit Melnik)
- WG5 - Methodological Issues and community interactions (co-leaders: Amir Zeldes and Gosse Bouma)
- Day 3 was dedicated to reporting, collecting new issues and re-designing the WGs.
- As a result, 5 other WGs were formed for days 4 and 5 and reported on on day 5:
- WG6 - Below and beyond word boundaries (co-leaders: David Yarowsky and Omer Goldman), continuation of WG1
- WG7 - Construction grammar meets Universal Dependencies (co-leaders: Lori Levin and Peter Ljunglöf), continuation of WG2 and WG4
- WG8 - To semantics and beyond! (co-leaders: Archna Bhatia and Kilian Evang)
- WG9 - Cross community/formalism discussions (big, hairy problems) (co-leaders: Chris Manning and Laura Kallmeyer)
- WG10 - Large language models (and other NLP tools) (co-leaders: Francis Tyers and Mathieu Constant)(No report was provided for this group, which only met for a short session before splitting into other groups.)
- Wednesday afternoon featured a hike in the surrounding countryside.
- The evenings were dedicated to socializing. This included a piano-violin duet, a guitar duet, a jazz improvisation, a swing dancing duet, and a choir singing songs suggested by the participants, in English, Georgian, German, and Latin (for the sake of language diversity!).
All the inputs and instantaneously produced outcomes (minutes, slides, useful links) are downloadable from our Wiki space.
The event attracted 37 participants. Their feedback during and after the event was mostly enthusiastic. At least one group formed at the event continues online meetings to further discuss the scientific challenges (representation of constructions in the Universal Dependencies framework).
Based on the reports submitted by the WG co-leaders and by individual proposers of discussion issues, we can estimate the extent to which the event achieved its initial objectives.
- On the networking side, the seminar brought together several pre-existing communities and allowed them to achieve synergies:
- Linguistic experts specialized in analyzing constructions and collecting them in so-called constructicons, intensely collaborated with NLP experts, notably over the problem of how to represent constructions formally and query them in corpora.
- While the community of typology experts was unfortunately under-represented (despite the best efforts of the organizers), the few attending experts were frequently consulted, which yielded several enlightening discussions.
- The communities of Universal Dependencies (UD) and Universal Morphology (Uni\-Morph) converged, even further than initially expected, around the problems of annotating subword units.
- The communities of UD and PARSEME, which had started aligning objectives prior to the seminar, further strengthened coordination.
- New links were established between PARSEME and the Universal Meaning Representation (UMR) community. This effect is important since the former models lexical and morpho-syntactic properties of MWEs, while the latter offers a framework for representing their semantics.
- An unplanned networking effect also occurred between our seminar and Dagstuhl Seminar 23192 on „Topological Data Analysis and Applications“, running in parallel on the same site. Bei Wang Phillips (University of Utah -- Salt Lake City, US) gave an evening invited talk to our invitees on the applications of topological methods to interpretability of word embeddings in distributional semantics.
- On the theoretical side, the seminar focused even more than expected on the notion of construction, which is broader and harder to capture than multiword expressions, and has been defined in wildly divergent ways across different communities. The confluence of different communities led to theoretical results including the following:
- Steps were taken towards a formal definition of construction, as an expression in a formal graph language (similar to the one supported by the Grew-match corpus browser)
- Advances in formalizing the notion of an „interesting“ construction, which relates to the notion of idiosyncrasy, a core concept in a narrower guise in the multiword expression community
- Formalizing the task of searching for „a similar but different construction“ as an instance of the theoretical problem of approximate tree/graph matching
- Progress towards understanding the notion of idiosyncrasy as an instance of rule breaking which is „creative“ and „has a purpose“, as opposed to, for instance, plain grammar/spelling errors (rule breaking with no purpose)
- Understanding idiosyncrasy via cross-lingustic triangulation -- what is seen as idiosyncratic in one language can be systematic across many languages/language families (e.g.\ kinship terms)
- Progress towards formalizing the annotation of semantics of UD and multiword expressions, especially for temporal and negation expressions
- On the practical side, discussions at the seminar led to a number of proposals for tools, procedures or practices to support interdisciplinary research. Some of these were tested out already in Dagstuhl, while others are being realized in follow-up activities to the seminar. The following is a non-exhaustive list of examples:
- Practical steps were taken towards improved UD guidelines for multiword expressions, which will facilitate interfacing UD and PARSEME in the future.
- Concrete guidelines were drafted for representing subword units in UD, which will facilitate the integration of resources from UD and UniMorph.
- Discussions of construction-oriented UD guidelines (based on „a Swadesh list for morphosyntax“) resulted in a prototype implementation with links to annotation examples in different languages.
- Discussions of future extensions of UD explored concrete proposals for new feature mechanisms to incorporate notions of constituency.
- Practical exercises demonstrated how the grew-match system (and other search tools) can be used to search for constructions in linguistic corpora.
- Participants discussed concrete proposals for automatically identifying idiosyncratic phenomena in corpora.
The survey organized by the Dagstuhl Officers shortly after the event shows very encouraging results (in most categories it was ranked higher than the average of the Dagstuhl Seminars from the past 60 days). The major drawbacks noticed by the participants were the insufficient number of experts in typology (less than 5%), this was notably due to the last minute cancellation, for personal reasons, by William Croft, one of the 4 co-organizers of the event, and of young researchers (about 32%).
References
- Timothy Baldwin, William Croft, Joakim Nivre, and Agata Savary. 2021. Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (Dagstuhl Seminar 21351). Dagstuhl Reports, 11(7), pages 89-138.
Computational linguistics builds models that can usefully process and produce language and that can increase our understanding of linguistic phenomena. From a computational perspective, language is particularly challenging notably due to its variable degree of idiosyncrasy (unexpected properties shared by few peer objects), and the pervasiveness of non-compositional phenomena such as multiword expressions (whose meaning cannot be straightforwardly deduced from the meanings of their components, e.g. red tape, by and large, to pay a visit and to pull one’s leg) and constructions (conventional associations of forms and meanings). Additionally, if models and methods are to be consistent and valid across languages, they have to face specificities inherent either to particular languages, or to various linguistic traditions.
A few existing initiatives, such as Universal Dependencies1, PARSEME2 and UniMorph3, have been addressing these challenges with the aim of revealing the universals of idiosyncrasy in language, proposing cross-lingually applicable typologies and methodologies for language modelling, and creating highly multilingual language resources and tools. These efforts have been carried on relatively independently, resulting in partly diverging terminologies and methods.
The objectives of this Dagstuhl Seminar are threefold:
- Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity.
- Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity.
- Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.
In order to pursue these objectives, we propose a list of research questions grouped into thematic categories:
- Atomic units of language: Identifying words across languages. Relation of syntactic words to lexical units. Morphological universals in words.
- Syntactic annotation in presence of idiosyncrasies: Annotating expressions which are partly regular and partly irregular. Capturing syntactical idiosyncrasies of MWEs which capture generalisations at the level of types rather than tokens. The interplay between lexicon and treebanking.
- Syntax-semantics interface in treebanking: Division of labor between syntactic and semantic annotation. Modeling expressions whose regular vs. idiosyncratic nature is particularly hard to capture: serial verbs, light-verb constructions (to pay a visit) and verb-particle constructions (to bring about), functional MWEs (in spite of, because of, not only).
- Universals of idiosyncrasy: Universals of linguistic idiosyncrasy established so far. Cross-lingual characterization of idiomaticity and syntactic irregularity. Relations between the syntactic irregularity and semantic non-compositionality.
- Semantics of MWEs: Defining and testing semantic non-compositionality for rigorous and reproducible MWE annotation. Semantic calculus in MWEs.
- Exploratory issues: Long-term objectives to consider for universal-driven initiatives. Extension of the existing models and methods to syntactic constructions.
The expected outcomes of the seminar include: (i) enhanced unified versions of the already existing annotation guidelines put forward by UD, PARSEME and UniMorph, (ii) criteria for applying unified guidelines to specific languages, (iii) recommendations on syntactic and semantic representation of MWEs in lexicons, and (iv) recommendations on how to cover grammatical constructions within treebanking frameworks and NLP tools.
The list of invitees includes researchers in NLP, linguistics and typology, with expertise in morphology, syntax, semantics, MWEs, constructions, annotation, parsing, and dozens of languages from diverse language families. They are based in 22 countries, spread across 5 continents.
This Dagstuhl Seminar is a follow-up event of the 2-day online seminar on 30-31 August 2021 (21351) under the same title. The seminar could meet part of the initial objectives and provided a proof of concept for the project behind the current seminar. We would be very happy to have you on board!
1 http://universaldependencies.org/
2 http://www.parseme.eu; https://gitlab.com/parseme/corpora/-/wikis/
3 https://unimorph.github.io/
- Timothy Baldwin (MBZUAI - Abu Dhabi, AE) [dblp]
- Emily M. Bender (University of Washington - Seattle, US) [dblp]
- Archna Bhatia (Florida IHMC - Ocala, US) [dblp]
- Nina Böbel (Universität Düsseldorf, DE)
- Francis Bond (Palacký University Olomouc, CZ) [dblp]
- Gosse Bouma (University of Groningen, NL) [dblp]
- Jörg Bücker (Universität Düsseldorf, DE)
- Mathieu Constant (ATILF - Nancy, FR) [dblp]
- Marie-Catherine de Marneffe (FNRS - UC Louvain, BE & Ohio State University - Columbus, US) [dblp]
- Kilian Evang (Universität Düsseldorf, DE)
- Daniel Flickinger (North Newton, US) [dblp]
- Omer Goldman (Bar-Ilan University - Ramat Gan, IL)
- Jan Hajic (Charles University - Prague, CZ) [dblp]
- Dag Haug (University of Oslo, NO) [dblp]
- Sylvain Kahane (University Paris Nanterre, FR)
- Laura Kallmeyer (Universität Düsseldorf, DE) [dblp]
- Maria Koptjevskaja-Tamm (Stockholm University, SE)
- Lori Levin (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Peter Ljunglöf (University of Gothenburg, SE)
- Teresa Lynn (MBZUAI - Abu Dhabi, AE) [dblp]
- Christopher Manning (Stanford University, US) [dblp]
- Nurit Melnik (The Open University of Israel - Raanana, IL)
- Joakim Nivre (Uppsala University, SE) [dblp]
- Alexandre Rademaker (IBM Research - Sao Paulo, BR) [dblp]
- Carlos Ramisch (Aix-Marseille University, FR) [dblp]
- Manfred Sailer (Goethe-Universität Frankfurt am Main, DE) [dblp]
- Agata Savary (University Paris-Saclay, CNRS - Orsay, FR) [dblp]
- Nathan Schneider (Georgetown University - Washington, DC, US) [dblp]
- Sara Stymne (Uppsala University, SE) [dblp]
- Reut Tsarfaty (Bar-Ilan University - Ramat Gan, IL) [dblp]
- Francis M. Tyers (Indiana University - Bloomington, US) [dblp]
- Ekaterina Vylomova (The University of Melbourne, AU) [dblp]
- Leonie Weissweiler (LMU München, DE)
- Nianwen Xue (Brandeis University - Waltham, US) [dblp]
- David Yarowsky (Johns Hopkins University - Baltimore, US) [dblp]
- Amir Zeldes (Georgetown University - Washington, DC, US) [dblp]
- Daniel Zeman (Charles University - Prague, CZ) [dblp]
Related Seminars
- Dagstuhl Seminar 21351: Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (2021-08-30 - 2021-08-31) (Details)
Classification
- Artificial Intelligence
- Computation and Language
Keywords
- computational linguistics
- morphosyntax
- multiword expressions
- language universals
- idiosyncrasy