Task 2: Morphological Analysis and Lemmatization in Context

The second task that we will offer this year is contextual morphological analysis and lemmatization. You are given a sentence. You are required to give the lemma and morphosyntactic description (MSD) of each word.

For instance, when given this entry:

# sent-id = 1
# text = They buy and sell books.
They   _	_	_	_	_	_	_	_
buy    _	_	_	_	_	_	_	_
and    _	_	_	_	_	_	_	_
sell   _	_	_	_	_	_	_	_
books  _	_	_	_	_	_	_	_
.      _	_	_	_	_	_	_	_

Your system must produce the following.

# sent-id = 1
# text = They buy and sell books.
They   they	_	_	N;NOM;PL    _	_	_	_
buy    buy	_	_	V;SG;1;PRS	_	_	_	_
and    and	_	_	CONJ        _	_	_	_
sell   sell	_	_	V;PL;3;PRS  _	_	_	_
books  book	_	_	N;PL        _	_	_	_
.      .	_	_	PUNCT       _	_	_	_

It is not required that your system reproduce comment lines (those that begin with a hash-mark (‘#’)). Conversely, you may produce as many comment lines as you would like! This will not affect the evaluation. Nevertheless, each sentence must be followed by exactly one blank line, in accordance with the CoNLL-U standard. This includes the last sentence. Blank lines must not exist within sentences.

Data

The data is owes its provenance to the Universal Dependencies project, and the MSDs have been converted to the UniMorph schema.

Sentences are annotated in the ten-column CoNLL-U format. All columns except for the ID, FORM, LEMMA, and FEATS will be nulled out (i.e., replaced with the underscore ‘_’). At inference time, test data will also null out the LEMMA and FEATS columns. Your system must print the data in CoNLL-U format, filling these two columns.

The ID column gives each word a unique ID within the sentence.
The FORM column gives the word as it appears in the sentence.
The LEMMA column contains the form’s lemma.
The FEATS column contains morphosyntactic features in the UniMorph schema.

Data are available here.

Evaluation

We will score each system based on two measures for each of the two subtasks

Contextual Morphological Analysis
- 0/1 accuracy of MSD (the FEATS column)
- Micro-averaged F1 score for MSDs (the FEATS column)
Contextual Lemmatization
- 0/1 accuracy of lemmata (the LEMMA column)
- Average Levenstein distance of lemmata (the LEMMA column)

We will distribute our evaluation script so you can test your systems. If you find errors in the script, contact us at sigmorphon+sharedtask2019@gmail.com.

Pretrained Systems

Non-neural models using Lemming are available here. Pretrained models for a strong, neural baseline are also available here, and scores will be given in the README.