Task 0: Typologically Diverse Morphological Inflection

SIGMORPHON’s fifth installment of its inflection generation shared task focuses on languages that are typologically diverse from languages in our previous tasks. Many of these languages are extremely low-resource. In this edition, we are specifically interested in inflection generation systems’ ability to generalize to new languages, including languages that are typologically distinct. For example, if you have a neural network architecture that works well for a sample of Indo-European languages, should you expect the same architecture to also work well for Tupi–Guarani languages (where nouns are “declined” for tense)? The organizers suspect not, but you could prove us wrong!

Outline

Shared Task Description

In this shared task, participants will design a model that learns to generate morphological inflections from a lemma and a set of morphosyntactic features of the target form. Each language in the task has its own training, development, and test splits. Training and development splits contain triples, each consisting of a lemma, a target form, and a set of morphological features, provided in the UniMorph format (the “Data” section below provides an example of input format). Test splits only provide lemmas and morphological tags: your model will need to predict the missing target form.

The model should be general enough to work for natural languages of any typological patterning.1 For example, Tagalog verbs exhibit circumfixation; thus, a model with a strong inductive bias towards suffixing will likely not work well for Tagalog. The task will proceed in three phases: a Development Phase, a Generalization Phase, and an Evaluation Phase. As the task progresses, more data and more languages will be released.

In the Development Phase, we will provide training and development splits from the Austronesian, Niger-Congo, Oto-Manguean, Uralic and Indo-European language families that should be used to develop your system. We will refer to them as the development languages. See the table below for a complete list.

In the Generalization Phase, we will provide training and development splits for new languages where approximately half are genetically related (belong to the same family) and half are genetically unrelated (are isolates or belong to a different family) to the development languages. We will keep the languages in the Generalization Phase a surprise until April 2020 (see timeline). We will also keep the genetically unrelated language families a surprise, though some languages will come from the same families as those in the Development Phase.

In the Evaluation Phase, the participants’ models will be evaluated on held-out forms from all of the languages from the previous phases. The languages from the Development Phase and the Generalization Phase are evaluated simultaneously. The only difference is that there has been more time to construct a model for those languages released in the Development Phase. It follows that a model could easily overfit to or favor phenomena that are more frequent in languages presented in the Development Phase, especially if parameters are shared across languages. For instance, a model based on the morphological patterning of the Indo-European languages may end up with a bias towards suffixing and will struggle to learn prefixing or circumfixation, the degree to which only becomes apparent during experimentation on other languages whose inflectional morphology patterns differ. Of course, the model architecture itself could explicitly or implicitly favor certain word formation types (suffixing, prefixing, etc.).

1 See References

Glossary

The shared task features both held-out morphological inflection triples and surprise languages. The organizers created a short glossary of task-specific terminology for clarity. We use the language described in this glossary unambiguously throughout the task description.

  • Development language: A language for which the participants will have an elongated period of time (about two months) to construct a machine learning model for morphological inflection generation.
  • Surprise language: A language for which the participants will have a short period of time (about one week) to construct or adapt a machine learning model for morphological inflection generation. The idea is that the participants apply the knowledge accrued from the Development Phase to choose a good model and good hyperparameters. Most of the surprise languages will be typologically distinct from the development languages, i.e. the majority of the languages will be taken from language families other than the ones used during the Development Phase.
  • Training split: A selection of lemma–form–tag triples for a language (either development or surprise) that the participants may train their machine learning model on.
  • Development split: A selection of lemma–form–tag triples for a language (either development or surprise) that the participants may tune the hyperparameters of their machine learning model on.
  • Test split: A selection of lemma–tag pairs for a language (either development or surprise) for which the participants will predict target forms. The organizers will evaluate the models based on these predictions.

Timeline

Stage 1: Development Phase

  • February 24th, 2020: Training and development splits for development languages released; we invite participants to report errors.
  • February 24th, 2020: Neural and non-neural baselines for development languages released.
  • March 1st, 2020: Development language data are frozen.

Stage 2: Generalization Phase

  • April 13th April 20, 2020: Training and development splits for surprise languages released.
    (This is not a zero-shot learning task. Participants will be given training data for all languages.)

Stage 3: Evaluation Phase

  • April 20th April 27, 2020: Test splits for all languages (both development and surprise) released.
  • April 27th May 4, 2020: Participants submit test predictions on all languages.

Stage 4: Write-up Phase

  • May 11th May 17, 2020: Participants’ system description papers due.
  • May 18th May 24, 2020: Participants’ system description papers camera ready due.

Data

The training and development data are provided in a simple utf-8 encoded text format for both the development and surprise languages. Each line in a file is an example that consists of word forms and corresponding morphosyntactic descriptions (MSDs) provided as a set of features, separated by semicolons. We refer to the MSDs as (morphological) tags for simplicity. The fields on a line are TAB-separated. The fields are: lemma, target form, tag. Here we present an example from the Akan training data (the Akan verb “bisa” means “to ask” in English):

bisa     mmbisa     V;PRS;HAB;NEG

In the training data, we give all three fields. In the test phase, we omit field 2.

We will provide varying amounts of labeled training data, depending on the language, to assess models’ ability to generalize to novel forms, in addition to information about each language’s family and sub-family, and WALS features which participants may optionally use. For each language, the possible inflections are taken from a finite set of morphological tags, presented in the UniMorph schema.

Development Languages

The task features 90 languages in total.2 45 of these 90 languages are development languages. The development languages will come from five language families: Austronesian, Niger–Congo, Uralic, Oto-Manguean, and Indo-European. We list each of the development languages with its family and genus (subfamily) below.

Language ISO 639-3 Family Genus # Train # Dev
Malagasy mlg Austronesian Barito 447 62
Cebuano ceb Austronesian Greater Central Phillipine 420 58
Hiligaynon hil Austronesian Greater Central Phillipine 859 116
Tagalog tgl Austronesian Greater Central Phillipine 1870 236
Maori mao Austronesian Oceanic 145 21
Danish dan Indo-European North Germanic 17852 2550
Icelandic isl Indo-European North Germanic 53841 7690
Norwegian Bokmål nob Indo-European North Germanic 13263 1929
Swedish swe Indo-European North Germanic 54888 7840
Dutch nld Indo-European West Germanic 38826 5547
English eng Indo-European West Germanic 80865 11553
German deu Indo-European West Germanic 99405 14201
Middle High German gmh Indo-European West Germanic 496 71
North Frisian frr Indo-European West Germanic 1902 224
Old English ang Indo-European West Germanic 29270 4122
Chewa nya Niger-Congo Bantu 3059 429
Kongo kon Niger-Congo Bantu 568 76
Lingala lin Niger-Congo Bantu 159 23
Luganda lug Niger-Congo Bantu 3420 489
Sotho sot Niger-Congo Bantu 345 50
Swahili swa Niger-Congo Bantu 3374 469
Zulu zul Niger-Congo Bantu 322 42
Akan aka Niger-Congo Kwa 2793 380
gaa Niger-Congo Kwa 607 79
Tlatepuzco Chinantec cpa Oto-Manguean Chinantecan 5298 727
San Pedro Amuzgo Amuzgos azg Oto-Manguean Amuzgo-Mixtecan 8482 1188
Yoloxóchitl Mixtec xty Oto-Manguean Amuzgo-Mixtecan 2110 299
Chichicapan Zapotec zpv Oto-Manguean Popolocal-Zapotecan 805 113
Yaitepec Chatino ctp Oto-Manguean Popolocal-Zapotecan 2397 313
Zenzontepec Chatino czn Oto-Manguean Popolocal-Zapotecan 1088 154
Eastern Highland Chatino cly Oto-Manguean Popolocal-Zapotecan 3301 471
Eastern Highland Otomi otm Oto-Manguean Oto-Pamean 21533 3020
Mezquital Otomi ote Oto-Manguean Oto-Pamean 22962 3231
Chichimec pei Oto-Manguean Oto-Pamean 10017 1349
Estonian est Uralic Finnic 26728 3820
Finnish fin Uralic Finnic 99403 14201
Ingrian izh Uralic Finnic 763 112
Karelian krl Uralic Finnic 80216 11225
Livonian liv Uralic Finnic 2787 398
Veps vep Uralic Finnic 94395 13320
Votic vot Uralic Finnic 1003 146
Meadow Mari mhr Uralic Mari 71143 10081
Erzya myv Uralic Mordvinic 74929 10738
Moksha mdf Uralic Mordvinic 46362 6633
Northern Sami sme Uralic Sami 43877 6273

2The organizers may increase the number of total languages, if annotation efforts allow.

Surprise Languages

The remaining 45 of these 90 languages will be surprise languages. The shared task organizers will provide the participants with enough time (about a week according to the current timeline) to train a model that they have previously selected on the development languages. However, there will not be enough time for choosing a new model or extensive hyperparameter tuning.

Which languages? They’re a surprise! Surprise no longer!

Language ISO 639-3 Family # Train # Dev
Maltese mlt Afro-Asiatic 1233 176
Oromo orm Afro-Asiatic 1424 203
Classical Syriac syc Afro-Asiatic 1917 275
Cree cre Algic 4571 584
Murrinh-Patha mwf Australian 777 111
Kannada kan Dravidian 3670 524
Telugu tel Dravidian 952 136
Middle Low German gml Germanic 890 127
Swiss German gsw Germanic 1345 192
Norwegian Nynorsk nno Germanic 10101 1443
Bengali ben Indo-Aryan 2816 402
Hindi hin Indo-Aryan 36300 5186
Sanskrit san Indo-Aryan 22968 3188
Urdu urd Indo-Aryan 8486 1213
Persian fas Iranian 25225 3603
Pushto; Pashto pus Iranian 4861 695
Tajik tgk Iranian 53 8
Shona sna Niger-Congo 1897 246
Zarma dje Nilo-Sahan 56 9
Asturian ast Romance 5096 728
Catalan cat Romance 51944 7421
Middle French frm Romance 24612 3516
Friulian fur Romance 5408 772
Galician glg Romance 24087 3441
Ladin lld Romance 5073 725
Venetian vec Romance 12203 1743
Anglo-Norman xno Romance 178 26
Tibetan bod Sino-Tibetan 3428 466
Dakota dak Siouan 2636 376
Evenki evn Tungusic 5413 774
Azerbaijani aze Turkic 5602 801
Bashkir bak Turkic 8517 1217
Crimean Tatar; Crimean Turkish crh Turkic 5215 745
Kazakh kaz Turkic 7852 1063
Kyrgyz kir Turkic 3855 547
Khakas kjh Turkic 840 120
Turkmen tuk Turkic 20963 2992
Uighur; Uyghur uig Turkic 5372 750
Uzbek uzb Turkic 25199 3596
Komi-Zyrian kpv Uralic 57919 8263
Ludian lud Uralic 294 41
Livvi olo Uralic 43936 6260
Udmurt udm Uralic 88774 12665
Võro vro Uralic 357 51
Papago (O'odham) ood Uto-Aztecan 1123 160

Multilingual Modeling [Recommendation]

Many of the languages in the shared task have only a few morphological forms annotated. In most cases, this is not because we have a larger stash that we are withholding, but rather because there is no resource known to the organizers for such data. To model these low-resource languages well, the organizers recommend a multilingual approach that exploits the genetic similarity within the development and surprise languages provided. For instance, we only give the participants a handful of lemma–form–tag triples of, say, Kongo, generalization will be difficult without using data from related Niger-Congo languages.

Restrictions

Additional UniMorph (and ICGI) data beyond what is provided is not allowed for model training. There are no other restrictions about what sort of data you can use for this task. For example, if you would like to use a large, unlabeled corpus, such as Wikipedia, that is acceptable. You may also use a pre-trained language model, e.g. BERT (Devlin et al. 2019). However, we will evaluate models in two different categories: (1) those that use external resources (beyond what is provided by the task), and (2) those that do not. The constrained data category (2) will be restricted to monolingual models, while category (1) may include multilingual models – we encourage you to be creative! Participants are asked to clearly specify the submission category.

Evaluation

Our shared task also comes with a somewhat novel experimental design. We will simultaneously evaluate models for both the Development languages, whose training and development sets will be available for an elongated period of time, and the Surprise languages, whose training and development sets will only be available for a short time prior to submission, which precludes extensive tuning. To be officially ranked, you must submit results for all evaluation languages. Thus, to succeed, your class of models (e.g. neural sequence-to-sequence models or weighted finite-state transducers with hand-crafted features) must generalize well to the group of Surprise languages that are typologically distinct from the Development languages you performed model selection on. To repeat: This is not a zero-shot learning task, but rather our evaluation set-up is designed to test the inherent inductive bias in the participants’ chosen model class. We attribute the inspiration for this experimental design to Emily Bender, who often advocates for such positions.

We will simultaneously evaluate the accuracy on held-out forms for languages from the following three categories of languages separately: 1) held-out forms from the Development languages, 2) held-out forms from genetically related Surprise languages, and 3) held-out forms from genetically unrelated Surprise languages. This tripartite split should give the field insight into how reliable performance of certain classes of models are on typologically distinct languages. It should also help answer the following question: If my model class works well when trained on English and many others, will the same model class work well on languages which exhibit linguistic characteristics distinct from English?

As mentioned in Restrictions above, we will evaluate submissions in two categories: monolingual, constrained data models, and unconstrained – the world is your oyster!

Evaluation Script

We will distribute an evaluation script for your use on the development data. The script will report:

  • Accuracy = fraction of correctly predicted forms
  • Average Levenshtein distance between the prediction and the truth across all predictions

The official evaluation script that we will use for our internal evaluation will be provided here. We encourage ablation studies to measure the advantage gained from particular innovations. You should perform these studies on the development data and report the findings in your system description paper.

Averaging

We will evaluate each language separately. A per-family aggregate evaluation will weight all languages in a family (see above) equally, i.e., macro-averaging, including the languages released later during the Generalization phase.

Overview paper

In the overview paper for the shared task, we will compare the performance of submitted systems in detail. We will evaluate:

  • which systems are significantly different in performance, especially in low-resource scenarios
  • which examples were hard, and which types of systems succeeded on them
  • which systems would provide complementary benefit in an ensemble system

Baselines

The organizers will provide two pre-trained baselines for the participants’ consumption. Their use is optional and provided to help the participants develop their own models faster.

Non-Neural Baseline

The first baseline is a non-neural system that has been used as a baseline in earlier shared tasks on morphological reinflection (Cotterell et al., 2017; Cotterell et al., 2018). The system first heuristically extracts lemma-to-form transformations; it assumes that these transformations are suffix- or prefix-based. A simple majority classifier is used to apply the most frequent suitable transformation to an input lemma, given the morphological tag, yielding the output form. Please see Cotterell et al. (2017) for further details.

Neural Baseline

The second baseline is a multilingual transformer (Vaswani et al., 2017). The version of this model adopted for character-level tasks currently holds the state-of-the-art on the 2017 SIGMORPHON shared task data. The transformer takes the lemma and morphological tags as input and outputs the target inflection. Given the low-resource setup, a single model will be trained on all languages. Additionally, we consider the data augmentation technique used by Anastasopoulos and Neubig (2019) as another baseline.

Submission

Participants will be asked to submit a tarball of their models’ predictions to sigmorphon2020sharedtask@gmail.com at the end of the task on April 27, 2020. The exact file format will be clarified one week in advance on the shared task’s mailing list. A team (group of participants) may submit predictions from as many models as they would like. Each submission will be scored separately. Submissions must specify whether they are (1) unconstrained (use external resources) or (2) constrained (use only the data from our released splits). For evaluation purposes, we will rank using an aggregate of all test languages, so participants are encouraged to submit for all languages. Results will be announced as a public Google sheet a few days after submission.

Participants’ system description papers will be handled through softconf. Papers can be submitted at https://www.softconf.com/acl2020/SIGMORPHON/

References

Anastasopoulos and Neubig. “Pushing the Limits of Low-Resource Morphological Inflection.” Proceedings of EMNLP 2019.

Cotterell et al. “CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages.” Proceedings of the CoNLL-SIGMORPHON 2017 Shared Task.

Cotterell et al. “The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection.” Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection.

Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL 2019.

Vaswani et al. “Attention is All You Need.” Proceedings of NeurIPS 2017.

Introduction into morphology:
    Haspelmath, M. (2002). Understanding Morphology. Oxford University Press,
       USA.
    Aronoff, M., & Fudeman, K. (2011). What is Morphology? (Vol. 8). John Wiley & Sons.

More detailed studies on morphological typology:
    Baerman, M. (Ed.). (2015). The Oxford Handbook of Inflection. Oxford University Press,
       USA.
    Song, J. J. (2014). Linguistic Typology: Morphology and Syntax. Routledge.
    Song, J. J. (2010). The Oxford handbook of Linguistic Typology. USA.
    Malchukov, A. and Spencer, A. (2008). The Handbook of Case. Oxford University Press,
       USA.

Language-specific descriptions:
    You may find more detailed information on some languages here: https://langsci-press.org/series

Contact

Point of Contact: Ekaterina Vylomova
Discussion: Task 0 Google Group
Submission: sigmorphon2020sharedtask@gmail.com

Task Organization

Logistics

Adina Williams (Facebook AI Research NYC, USA)
Christo Kirov (Google Research NYC, USA)
Ekaterina Vylomova (University of Melbourne, Australia)
Eleanor Chodroff (University of York, UK)
Elizabeth Salesky (Johns Hopkins University, USA)
Mans Hulden (University of Colorado Boulder, USA)
Miikka Silfverberg (University of British Columbia, Canada)
Ryan Cotterell (ETH Zürich, Switzerland)
Sabrina Mielke (Johns Hopkins University, USA)
Shijie Wu (Johns Hopkins University, USA)

Data Annotation

Andrej Krizhanovsky (Karelian Research Centre, Russia)
Antonios Anastasopoulos (Carnegie Mellon University, USA)
Edoardo Ponti (University of Cambridge, UK)
Elena Klyachko (National Research University Higher School of Economics, Russia)
Ilya Yegorov (Lomonosov Moscow State University, Russia)
Irene Nikkarinen (University of Cambridge, UK)
Jennifer White (University of Cambridge, UK)
Josef Valvoda (University of Cambridge, UK)
Kyle Estment (University of Cambridge, UK)
Lucas Torroba Hennigen (University of Cambridge, UK)
Natalia Krizhanovsky (Karelian Research Centre, Russia)
Paula Czarnowska (University of Cambridge, UK)
Ran Zmigrod (University of Cambridge, UK)
Rowan Hall Maudslay (University of Cambridge, UK)
Svetlana Toldova (National Research University Higher School of Economics, Russia)
Tiago Pimentel (University of Cambridge, UK)