Universal Dependencies - Morphosyntactically Tagged Corpora

UDMorph is a gathering point for morphosyntactically annotated corpora and taggers across many languages and writing styles, following the annotation guidelines of the Universal Dependencies (UD). It has three primary objectives:

  1. A point-of-entry for taggers: to provide a list of taggers with an online interface for as many languages as possible, making it possible to run NLP tasks beyond the major languages. Taggers can be existing online REST service tools, locally installed existing taggers, or taggers trained locally on collected training data. The list of taggers will be updated whenever new or better taggers become available.

  2. A collection of annotated data: to provide annotated training data for as many languages and writing styles as possible, so that more taggers can be trained on those data. All data contain at least a UD part-of-speech (POS) tag, possibly also an original non-UD POS tag, and a lemma. The data collection is intended as a supplement for the main UD data, but for data without dependency relations. 

  3. An annotation tool: to provide an easy-to-use online tool to create new annotated data, with interactive help to help follow the UD guidelines, and computational tools to provide pre-tagging that can be manually corrected. Locally created datasets will be added as taggers once a sufficient size is reached, and will be made available as a git repository and a HuggingFace dataset. We intend to organize courses and hackaton sessions to get people started with new annotated datasets.

More information about UDMorph can be found in the mission statement. Information about the current size of the data in UDMorph can be found on the statistics page.

We very much invite people to contribute to UDMorph, since it will only grow with the participation of researchers around the world. Read more on how to contribute.