In this task, participants will create computational models that map a sequence of “graphemes”—characters—representing a word to a transcription of that word’s pronunciation.
This task is an important part of speech technologies including recognition and synthesis.
Data
Source
The data is primarily extracted from Wiktionary
using the wikipron
library (Lee et
al. in press).
Languages
We initially provide data for 10 languages:
- Armenian (
arm
) - Bulgarian (
bul
) - French (
fre
) - Georgian (
geo
) - Hindi (
hin
) - Hungarian (
hun
) - Icelandic (
ice
) - Korean (
kor
) - Lithuanian (
lit
) - Modern Greek (
gre
)
Update (2020-04-20): the surprise langauges are now announced. They are:
- Adyghe (
ady
) - Dutch (
dut
) - Japanese hiragana (
jpn
) - Romanian (
rum
) - Vietnamese (
vie
)
Baseline results
Results for the three baselines will be made available here as they become available.
Official submission results
Results for submitted systems are available here
Size
There are 3600 training data examples and 450 development and test data examples for each language.
Format
Training and development data are UTF-8-encoded tab-separated values files. Each
example occupies a single line and consists of a grapheme
sequence—NFC
Unicode codepoints—a tab character, and the corresponding phone sequence, a
roughly-phonemic
IPA, tokenized
using the segments
library (Moran & Cysouw
2018). The following show three lines of Romanian data:
antonim a n t o n i m
ploaie p lʷ a j e
pornește p o r n e ʃ t e
Test files consist of a single column, containing grapheme sequences.
Please provide your results in the two-column (grapheme sequence, tab-character,
tokenized phone sequence) TSV format, the same one used for the training and
development data. If your system only provides the predicted phone sequences,
use the UNIX command-line tool paste
to combine the columns.
Data can be obtained here.
Exclusions
We exclude from the provided data any words which:
- have multiple pronunciations in the source data
- consist of less than 3 graphemes, or
- consist of less than 3 phonemes.
External data
Participants are permitted to use:
- open-source databases of phoneme inventories and features such as Phoible (Moran & McCloy 2019),
- open-source pronunciation data for languages not targeted in this challenge, and
- open-source morphological analyzers and lexicons such as UDLexicons (Sagot 2018).
Participants who use such data must disclose their use of it at time of submission.
Participants are not permitted to use any form of pronunciation data derived from Wiktionary, except for the provided training and test data; they are also not permitted to use external pronunciation dictionaries for any of the targeted languages.
Evaluation
Systems should predict a single phone sequence for each test example.
Metrics
The primary measure will be the word error rate (WER), which is the percentage
of words for which the hypothesized transcription sequence does not match the
gold transcription. We also report phone error rate (PER), the micro-averaged
edit distance between hypotheses and gold transcriptions, computed by summing
the minimum edit distance between the hypothesis and gold transcriptions and
then dividing by the summed length of the gold transcriptions. As is common
practice, we multiply both numbers by 100. Both metrics will be computed using
the provided Python script evaluate.py
, available here.
System comparison
We will evaluate on each language separately. The final system ranking will be produced by macro-averaging the per-language WERs. We will also employ statistical analysis for system comparison.
Baselines
We provide implementations of two baseline systems for the task:
- a pair n-gram model (Novak et al. 2016) implemented using the OpenGrm toolkit (Roark et al. 2012, Gorman 2016), and
- a bidirectional LSTM encoder-decoder sequence model implemented using the Fairseq toolkit (Ott et al. 2019).
The baselines are available here.
Participants are welcome to adapt these baselines for their purposes.
Submission
Participants will submit to this task by sending their models’ predictions to sigmorphon2020.task1@gmail.com by April 27th, 2020. Participants may submit predictions from as many models as they wish; each submission will be scored separately. Participants must submit predictions for all languages to be scored. Participants must specify any external resources used at time of submission.
System description papers will be submitted using softconf - links will be provided at a later date.
Timeline
- February 24th, 2020: Training and development splits for development languages released; we invite participants to report errors.
- February 24th, 2020: Neural and non-neural baselines for development languages released.
- April
13th20th, 2020: Training and development splits for surprise languages released. - April
20th27th, 2020: Test splits for all languages (both development and surprise) released. April 27thMay 5th, 2020: Participants submit test predictions on all languages.- May
11th17th, 2020: Participants’ system description papers due. - May
18th24th, 2020: Participants’ system description papers camera ready due.
Overview paper
In an overview paper for the shared task, we will compare the performance of submitted systems in detail. We will assess:
- which systems are significantly different in performance
- which languages were challenging and which types of systems succeeded on them, and
- which systems would provide complementary benefit in an ensemble system.
Included in the paper will be a summary of scores for all participants who produce outputs for all targeted languages.
Organizers
This task is organized by Lucas Ashby and Kyle Gorman at the Graduate Center, City University of New York, with help from other members of the WikiPron team.
Contact: Kyle Gorman.
References
Gorman, K. (2016). Pynini: a Python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pages 75–80, Berlin. Association for Computational Linguistics.
Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A., McCarthy, A. D., and Gorman, K. (in press). Massively multilingual pronunciation mining with WikiPron. To appear in the proceedings of LREC 2020.
Moran, S. and Cysouw, M. (2018). The Unicode cookbook for linguists: managing writing systems using orthography profiles. Berlin: Language Science Press.
Moran, S. and McCloy, D. (2019). PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History.
Novak, J. R., Minematsu, N., and Hirose, K. (2016). Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, 22(6):907–938.
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019). fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis. Association for Computational Linguistics.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T. (2012). The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61–66, Jeju Island, Korea. Association for Computational Linguistics.
Sagot, B. 2018. A multilingual collection of CoNLL-U-compatible morphological lexicons. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pages 1861-1867. Miyazaki, Japan. European Language Resources Association.