Task 0: Typologically Diverse Morphological Inflection

SIGMORPHON’s fifth installment of its inflection generation shared task focuses on languages that are typologically diverse from languages in our previous tasks. Many of these languages are extremely low-resource. In this edition, we are specifically interested in inflection generation systems’ ability to generalize to new languages, including languages that are typologically distinct. For example, if you have a neural network architecture that works well for a sample of Indo-European languages, should you expect the same architecture to also work well for Tupi–Guarani languages (where nouns are “declined” for tense)? The organizers suspect not, but you could prove us wrong!

Important Links

Register for the shared task using our Google form.
Please join our Google group to stay up to date.
Download the Data!
Download a Baseline System!
Baseline Numbers.

Shared Task Description

In this shared task, participants will design a model that learns to generate morphological inflections from a lemma and a set of morphosyntactic features of the target form. Each language in the task has its own training, development, and test splits. Training and development splits contain triples, each consisting of a lemma, a target form, and a set of morphological features, provided in the UniMorph format (the “Data” section below provides an example of input format). Test splits only provide lemmas and morphological tags: your model will need to predict the missing target form.

The model should be general enough to work for natural languages of any typological patterning.¹ For example, Tagalog verbs exhibit circumfixation; thus, a model with a strong inductive bias towards suffixing will likely not work well for Tagalog. The task will proceed in three phases: a Development Phase, a Generalization Phase, and an Evaluation Phase. As the task progresses, more data and more languages will be released.

In the Development Phase, we will provide training and development splits from the Austronesian, Niger-Congo, Oto-Manguean, Uralic and Indo-European language families that should be used to develop your system. We will refer to them as the development languages. See the table below for a complete list.

In the Generalization Phase, we will provide training and development splits for new languages where approximately half are genetically related (belong to the same family) and half are genetically unrelated (are isolates or belong to a different family) to the development languages. We will keep the languages in the Generalization Phase a surprise until April 2020 (see timeline). We will also keep the genetically unrelated language families a surprise, though some languages will come from the same families as those in the Development Phase.

In the Evaluation Phase, the participants’ models will be evaluated on held-out forms from all of the languages from the previous phases. The languages from the Development Phase and the Generalization Phase are evaluated simultaneously. The only difference is that there has been more time to construct a model for those languages released in the Development Phase. It follows that a model could easily overfit to or favor phenomena that are more frequent in languages presented in the Development Phase, especially if parameters are shared across languages. For instance, a model based on the morphological patterning of the Indo-European languages may end up with a bias towards suffixing and will struggle to learn prefixing or circumfixation, the degree to which only becomes apparent during experimentation on other languages whose inflectional morphology patterns differ. Of course, the model architecture itself could explicitly or implicitly favor certain word formation types (suffixing, prefixing, etc.).

¹ See References

Glossary

The shared task features both held-out morphological inflection triples and surprise languages. The organizers created a short glossary of task-specific terminology for clarity. We use the language described in this glossary unambiguously throughout the task description.

Development language: A language for which the participants will have an elongated period of time (about two months) to construct a machine learning model for morphological inflection generation.
Surprise language: A language for which the participants will have a short period of time (about one week) to construct or adapt a machine learning model for morphological inflection generation. The idea is that the participants apply the knowledge accrued from the Development Phase to choose a good model and good hyperparameters. Most of the surprise languages will be typologically distinct from the development languages, i.e. the majority of the languages will be taken from language families other than the ones used during the Development Phase.
Training split: A selection of lemma–form–tag triples for a language (either development or surprise) that the participants may train their machine learning model on.
Development split: A selection of lemma–form–tag triples for a language (either development or surprise) that the participants may tune the hyperparameters of their machine learning model on.
Test split: A selection of lemma–tag pairs for a language (either development or surprise) for which the participants will predict target forms. The organizers will evaluate the models based on these predictions.

Timeline

Stage 1: Development Phase

February 24th, 2020: Training and development splits for development languages released; we invite participants to report errors.
February 24th, 2020: Neural and non-neural baselines for development languages released.
March 1st, 2020: Development language data are frozen.

Stage 2: Generalization Phase

~~April 13th~~ April 20, 2020: Training and development splits for surprise languages released.
(This is not a zero-shot learning task. Participants will be given training data for all languages.)

Stage 3: Evaluation Phase

~~April 20th~~ April 27, 2020: Test splits for all languages (both development and surprise) released.
~~April 27th~~ May 4, 2020: Participants submit test predictions on all languages.

Stage 4: Write-up Phase

~~May 11th~~ May 17, 2020: Participants’ system description papers due.
~~May 18th~~ May 24, 2020: Participants’ system description papers camera ready due.

Data

The training and development data are provided in a simple utf-8 encoded text format for both the development and surprise languages. Each line in a file is an example that consists of word forms and corresponding morphosyntactic descriptions (MSDs) provided as a set of features, separated by semicolons. We refer to the MSDs as (morphological) tags for simplicity. The fields on a line are TAB-separated. The fields are: lemma, target form, tag. Here we present an example from the Akan training data (the Akan verb “bisa” means “to ask” in English):

bisa     mmbisa     V;PRS;HAB;NEG

In the training data, we give all three fields. In the test phase, we omit field 2.

We will provide varying amounts of labeled training data, depending on the language, to assess models’ ability to generalize to novel forms, in addition to information about each language’s family and sub-family, and WALS features which participants may optionally use. For each language, the possible inflections are taken from a finite set of morphological tags, presented in the UniMorph schema.

Development Languages

The task features 90 languages in total.² 45 of these 90 languages are development languages. The development languages will come from five language families: Austronesian, Niger–Congo, Uralic, Oto-Manguean, and Indo-European. We list each of the development languages with its family and genus (subfamily) below.

_Language	_{ISO 639-3}	_Family	_Genus	_{# Train}	_{# Dev}
_Malagasy	_mlg	_Austronesian	_Barito	₄₄₇	₆₂
_Cebuano	_ceb	_Austronesian	_{Greater Central Phillipine}	₄₂₀	₅₈
_Hiligaynon	_hil	_Austronesian	_{Greater Central Phillipine}	₈₅₉	₁₁₆
_Tagalog	_tgl	_Austronesian	_{Greater Central Phillipine}	₁₈₇₀	₂₃₆
_Maori	_mao	_Austronesian	_Oceanic	₁₄₅	₂₁
_Danish	_dan	_{Indo-European}	_{North Germanic}	₁₇₈₅₂	₂₅₅₀
_Icelandic	_isl	_{Indo-European}	_{North Germanic}	₅₃₈₄₁	₇₆₉₀
_{Norwegian Bokmål}	_nob	_{Indo-European}	_{North Germanic}	₁₃₂₆₃	₁₉₂₉
_Swedish	_swe	_{Indo-European}	_{North Germanic}	₅₄₈₈₈	₇₈₄₀
_Dutch	_nld	_{Indo-European}	_{West Germanic}	₃₈₈₂₆	₅₅₄₇
_English	_eng	_{Indo-European}	_{West Germanic}	₈₀₈₆₅	₁₁₅₅₃
_German	_deu	_{Indo-European}	_{West Germanic}	₉₉₄₀₅	₁₄₂₀₁
_{Middle High German}	_gmh	_{Indo-European}	_{West Germanic}	₄₉₆	₇₁
_{North Frisian}	_frr	_{Indo-European}	_{West Germanic}	₁₉₀₂	₂₂₄
_{Old English}	_ang	_{Indo-European}	_{West Germanic}	₂₉₂₇₀	₄₁₂₂
_Chewa	_nya	_Niger-Congo	_Bantu	₃₀₅₉	₄₂₉
_Kongo	_kon	_Niger-Congo	_Bantu	₅₆₈	₇₆
_Lingala	_lin	_Niger-Congo	_Bantu	₁₅₉	₂₃
_Luganda	_lug	_Niger-Congo	_Bantu	₃₄₂₀	₄₈₉
_Sotho	_sot	_Niger-Congo	_Bantu	₃₄₅	₅₀
_Swahili	_swa	_Niger-Congo	_Bantu	₃₃₇₄	₄₆₉
_Zulu	_zul	_Niger-Congo	_Bantu	₃₂₂	₄₂
_Akan	_aka	_Niger-Congo	_Kwa	₂₇₉₃	₃₈₀
_Gã	_gaa	_Niger-Congo	_Kwa	₆₀₇	₇₉
_{Tlatepuzco Chinantec}	_cpa	_Oto-Manguean	_Chinantecan	₅₂₉₈	₇₂₇
_{San Pedro Amuzgo Amuzgos}	_azg	_Oto-Manguean	_{Amuzgo-Mixtecan}	₈₄₈₂	₁₁₈₈
_{Yoloxóchitl Mixtec}	_xty	_Oto-Manguean	_{Amuzgo-Mixtecan}	₂₁₁₀	₂₉₉
_{Chichicapan Zapotec}	_zpv	_Oto-Manguean	_{Popolocal-Zapotecan}	₈₀₅	₁₁₃
_{Yaitepec Chatino}	_ctp	_Oto-Manguean	_{Popolocal-Zapotecan}	₂₃₉₇	₃₁₃
_{Zenzontepec Chatino}	_czn	_Oto-Manguean	_{Popolocal-Zapotecan}	₁₀₈₈	₁₅₄
_{Eastern Highland Chatino}	_cly	_Oto-Manguean	_{Popolocal-Zapotecan}	₃₃₀₁	₄₇₁
_{Eastern Highland Otomi}	_otm	_Oto-Manguean	_Oto-Pamean	₂₁₅₃₃	₃₀₂₀
_{Mezquital Otomi}	_ote	_Oto-Manguean	_Oto-Pamean	₂₂₉₆₂	₃₂₃₁
_Chichimec	_pei	_Oto-Manguean	_Oto-Pamean	₁₀₀₁₇	₁₃₄₉
_Estonian	_est	_Uralic	_Finnic	₂₆₇₂₈	₃₈₂₀
_Finnish	_fin	_Uralic	_Finnic	₉₉₄₀₃	₁₄₂₀₁
_Ingrian	_izh	_Uralic	_Finnic	₇₆₃	₁₁₂
_Karelian	_krl	_Uralic	_Finnic	₈₀₂₁₆	₁₁₂₂₅
_Livonian	_liv	_Uralic	_Finnic	₂₇₈₇	₃₉₈
_Veps	_vep	_Uralic	_Finnic	₉₄₃₉₅	₁₃₃₂₀
_Votic	_vot	_Uralic	_Finnic	₁₀₀₃	₁₄₆
_{Meadow Mari}	_mhr	_Uralic	_Mari	₇₁₁₄₃	₁₀₀₈₁
_Erzya	_myv	_Uralic	_Mordvinic	₇₄₉₂₉	₁₀₇₃₈
_Moksha	_mdf	_Uralic	_Mordvinic	₄₆₃₆₂	₆₆₃₃
_{Northern Sami}	_sme	_Uralic	_Sami	₄₃₈₇₇	₆₂₇₃

²The organizers may increase the number of total languages, if annotation efforts allow.

Surprise Languages

The remaining 45 of these 90 languages will be surprise languages. The shared task organizers will provide the participants with enough time (about a week according to the current timeline) to train a model that they have previously selected on the development languages. However, there will not be enough time for choosing a new model or extensive hyperparameter tuning.

Which languages? ~~They’re a surprise!~~ Surprise no longer!

_Language	_{ISO 639-3}	_Family	_{# Train}	_{# Dev}
_Maltese	_mlt	_Afro-Asiatic	₁₂₃₃	₁₇₆
_Oromo	_orm	_Afro-Asiatic	₁₄₂₄	₂₀₃
_{Classical Syriac}	_syc	_Afro-Asiatic	₁₉₁₇	₂₇₅
_Cree	_cre	_Algic	₄₅₇₁	₅₈₄
_{Murrinh-Patha}	_mwf	_Australian	₇₇₇	₁₁₁
_Kannada	_kan	_Dravidian	₃₆₇₀	₅₂₄
_Telugu	_tel	_Dravidian	₉₅₂	₁₃₆
_{Middle Low German}	_gml	_Germanic	₈₉₀	₁₂₇
_{Swiss German}	_gsw	_Germanic	₁₃₄₅	₁₉₂
_{Norwegian Nynorsk}	_nno	_Germanic	₁₀₁₀₁	₁₄₄₃
_Bengali	_ben	_Indo-Aryan	₂₈₁₆	₄₀₂
_Hindi	_hin	_Indo-Aryan	₃₆₃₀₀	₅₁₈₆
_Sanskrit	_san	_Indo-Aryan	₂₂₉₆₈	₃₁₈₈
_Urdu	_urd	_Indo-Aryan	₈₄₈₆	₁₂₁₃
_Persian	_fas	_Iranian	₂₅₂₂₅	₃₆₀₃
_{Pushto; Pashto}	_pus	_Iranian	₄₈₆₁	₆₉₅
_Tajik	_tgk	_Iranian	₅₃	₈
_Shona	_sna	_Niger-Congo	₁₈₉₇	₂₄₆
_Zarma	_dje	_Nilo-Sahan	₅₆	₉
_Asturian	_ast	_Romance	₅₀₉₆	₇₂₈
_Catalan	_cat	_Romance	₅₁₉₄₄	₇₄₂₁
_{Middle French}	_frm	_Romance	₂₄₆₁₂	₃₅₁₆
_Friulian	_fur	_Romance	₅₄₀₈	₇₇₂
_Galician	_glg	_Romance	₂₄₀₈₇	₃₄₄₁
_Ladin	_lld	_Romance	₅₀₇₃	₇₂₅
_Venetian	_vec	_Romance	₁₂₂₀₃	₁₇₄₃
_Anglo-Norman	_xno	_Romance	₁₇₈	₂₆
_Tibetan	_bod	_Sino-Tibetan	₃₄₂₈	₄₆₆
_Dakota	_dak	_Siouan	₂₆₃₆	₃₇₆
_Evenki	_evn	_Tungusic	₅₄₁₃	₇₇₄
_Azerbaijani	_aze	_Turkic	₅₆₀₂	₈₀₁
_Bashkir	_bak	_Turkic	₈₅₁₇	₁₂₁₇
_{Crimean Tatar; Crimean Turkish}	_crh	_Turkic	₅₂₁₅	₇₄₅
_Kazakh	_kaz	_Turkic	₇₈₅₂	₁₀₆₃
_Kyrgyz	_kir	_Turkic	₃₈₅₅	₅₄₇
_Khakas	_kjh	_Turkic	₈₄₀	₁₂₀
_Turkmen	_tuk	_Turkic	₂₀₉₆₃	₂₉₉₂
_{Uighur; Uyghur}	_uig	_Turkic	₅₃₇₂	₇₅₀
_Uzbek	_uzb	_Turkic	₂₅₁₉₉	₃₅₉₆
_Komi-Zyrian	_kpv	_Uralic	₅₇₉₁₉	₈₂₆₃
_Ludian	_lud	_Uralic	₂₉₄	₄₁
_Livvi	_olo	_Uralic	₄₃₉₃₆	₆₂₆₀
_Udmurt	_udm	_Uralic	₈₈₇₇₄	₁₂₆₆₅
_Võro	_vro	_Uralic	₃₅₇	₅₁
_{Papago (O'odham)}	_ood	_Uto-Aztecan	₁₁₂₃	₁₆₀

Multilingual Modeling [Recommendation]

Many of the languages in the shared task have only a few morphological forms annotated. In most cases, this is not because we have a larger stash that we are withholding, but rather because there is no resource known to the organizers for such data. To model these low-resource languages well, the organizers recommend a multilingual approach that exploits the genetic similarity within the development and surprise languages provided. For instance, we only give the participants a handful of lemma–form–tag triples of, say, Kongo, generalization will be difficult without using data from related Niger-Congo languages.

Restrictions

Additional UniMorph (and ICGI) data beyond what is provided is not allowed for model training. There are no other restrictions about what sort of data you can use for this task. For example, if you would like to use a large, unlabeled corpus, such as Wikipedia, that is acceptable. You may also use a pre-trained language model, e.g. BERT (Devlin et al. 2019). However, we will evaluate models in two different categories: (1) those that use external resources (beyond what is provided by the task), and (2) those that do not. The constrained data category (2) will be restricted to monolingual models, while category (1) may include multilingual models – we encourage you to be creative! Participants are asked to clearly specify the submission category.

Evaluation

Our shared task also comes with a somewhat novel experimental design. We will simultaneously evaluate models for both the Development languages, whose training and development sets will be available for an elongated period of time, and the Surprise languages, whose training and development sets will only be available for a short time prior to submission, which precludes extensive tuning. To be officially ranked, you must submit results for all evaluation languages. Thus, to succeed, your class of models (e.g. neural sequence-to-sequence models or weighted finite-state transducers with hand-crafted features) must generalize well to the group of Surprise languages that are typologically distinct from the Development languages you performed model selection on. To repeat: This is not a zero-shot learning task, but rather our evaluation set-up is designed to test the inherent inductive bias in the participants’ chosen model class. We attribute the inspiration for this experimental design to Emily Bender, who often advocates for such positions.

We will simultaneously evaluate the accuracy on held-out forms for languages from the following three categories of languages separately: 1) held-out forms from the Development languages, 2) held-out forms from genetically related Surprise languages, and 3) held-out forms from genetically unrelated Surprise languages. This tripartite split should give the field insight into how reliable performance of certain classes of models are on typologically distinct languages. It should also help answer the following question: If my model class works well when trained on English and many others, will the same model class work well on languages which exhibit linguistic characteristics distinct from English?

As mentioned in Restrictions above, we will evaluate submissions in two categories: monolingual, constrained data models, and unconstrained – the world is your oyster!

Evaluation Script

We will distribute an evaluation script for your use on the development data. The script will report:

Accuracy = fraction of correctly predicted forms
Average Levenshtein distance between the prediction and the truth across all predictions

The official evaluation script that we will use for our internal evaluation will be provided here. We encourage ablation studies to measure the advantage gained from particular innovations. You should perform these studies on the development data and report the findings in your system description paper.

Averaging

We will evaluate each language separately. A per-family aggregate evaluation will weight all languages in a family (see above) equally, i.e., macro-averaging, including the languages released later during the Generalization phase.

Overview paper

In the overview paper for the shared task, we will compare the performance of submitted systems in detail. We will evaluate:

which systems are significantly different in performance, especially in low-resource scenarios
which examples were hard, and which types of systems succeeded on them
which systems would provide complementary benefit in an ensemble system

Baselines

The organizers will provide two pre-trained baselines for the participants’ consumption. Their use is optional and provided to help the participants develop their own models faster.

Non-Neural Baseline

The first baseline is a non-neural system that has been used as a baseline in earlier shared tasks on morphological reinflection (Cotterell et al., 2017; Cotterell et al., 2018). The system first heuristically extracts lemma-to-form transformations; it assumes that these transformations are suffix- or prefix-based. A simple majority classifier is used to apply the most frequent suitable transformation to an input lemma, given the morphological tag, yielding the output form. Please see Cotterell et al. (2017) for further details.

Neural Baseline

The second baseline is a multilingual transformer (Vaswani et al., 2017). The version of this model adopted for character-level tasks currently holds the state-of-the-art on the 2017 SIGMORPHON shared task data. The transformer takes the lemma and morphological tags as input and outputs the target inflection. Given the low-resource setup, a single model will be trained on all languages. Additionally, we consider the data augmentation technique used by Anastasopoulos and Neubig (2019) as another baseline.

Submission

Participants will be asked to submit a tarball of their models’ predictions to sigmorphon2020sharedtask@gmail.com at the end of the task on April 27, 2020. The exact file format will be clarified one week in advance on the shared task’s mailing list. A team (group of participants) may submit predictions from as many models as they would like. Each submission will be scored separately. Submissions must specify whether they are (1) unconstrained (use external resources) or (2) constrained (use only the data from our released splits). For evaluation purposes, we will rank using an aggregate of all test languages, so participants are encouraged to submit for all languages. Results will be announced as a public Google sheet a few days after submission.

Participants’ system description papers will be handled through softconf. Papers can be submitted at https://www.softconf.com/acl2020/SIGMORPHON/

References

Anastasopoulos and Neubig. “Pushing the Limits of Low-Resource Morphological Inflection.” Proceedings of EMNLP 2019.

Cotterell et al. “CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages.” Proceedings of the CoNLL-SIGMORPHON 2017 Shared Task.

Cotterell et al. “The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection.” Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection.

Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL 2019.

Vaswani et al. “Attention is All You Need.” Proceedings of NeurIPS 2017.

Introduction into morphology:
    Haspelmath, M. (2002). Understanding Morphology. Oxford University Press,
       USA.
    Aronoff, M., & Fudeman, K. (2011). What is Morphology? (Vol. 8). John Wiley & Sons.

More detailed studies on morphological typology:
    Baerman, M. (Ed.). (2015). The Oxford Handbook of Inflection. Oxford University Press,
       USA.
    Song, J. J. (2014). Linguistic Typology: Morphology and Syntax. Routledge.
    Song, J. J. (2010). The Oxford handbook of Linguistic Typology. USA.
    Malchukov, A. and Spencer, A. (2008). The Handbook of Case. Oxford University Press,
       USA.

Language-specific descriptions:
You may find more detailed information on some languages here: https://langsci-press.org/series

Contact

Point of Contact: Ekaterina Vylomova
Discussion: Task 0 Google Group
Submission: sigmorphon2020sharedtask@gmail.com

Task Organization

Logistics

Adina Williams (Facebook AI Research NYC, USA)
Christo Kirov (Google Research NYC, USA)
Ekaterina Vylomova (University of Melbourne, Australia)
Eleanor Chodroff (University of York, UK)
Elizabeth Salesky (Johns Hopkins University, USA)
Mans Hulden (University of Colorado Boulder, USA)
Miikka Silfverberg (University of British Columbia, Canada)
Ryan Cotterell (ETH Zürich, Switzerland)
Sabrina Mielke (Johns Hopkins University, USA)
Shijie Wu (Johns Hopkins University, USA)

Data Annotation

Andrej Krizhanovsky (Karelian Research Centre, Russia)
Antonios Anastasopoulos (Carnegie Mellon University, USA)
Edoardo Ponti (University of Cambridge, UK)
Elena Klyachko (National Research University Higher School of Economics, Russia)
Ilya Yegorov (Lomonosov Moscow State University, Russia)
Irene Nikkarinen (University of Cambridge, UK)
Jennifer White (University of Cambridge, UK)
Josef Valvoda (University of Cambridge, UK)
Kyle Estment (University of Cambridge, UK)
Lucas Torroba Hennigen (University of Cambridge, UK)
Natalia Krizhanovsky (Karelian Research Centre, Russia)
Paula Czarnowska (University of Cambridge, UK)
Ran Zmigrod (University of Cambridge, UK)
Rowan Hall Maudslay (University of Cambridge, UK)
Svetlana Toldova (National Research University Higher School of Economics, Russia)
Tiago Pimentel (University of Cambridge, UK)