UDMorph Mission Statement
Universal Dependencies is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. The UD project hosts treebanks that conform to the UD standard, with regular releases and consistency check. The UD treebanks are freely available as Git repositories, and are used by the community to train a range of different NLP tools, including UDPIPE and Stanza.
But in order to be hosted on the UD framework, a corpus needs to be a treebank, tagged with dependency relations. And despite the impressive collection of language in UD, there are many more language for which there are annotated corpora adorned with part-of-speech tags and lemmas, but not dependency relations. And not only languages, but also dialects, historic variants, specialized domains, etc. The motiviation behind the current project is to provide a framework for hosting such tagged corpora to provide training data for tokenizers, segmentation tools, taggers, and lemmatizers.
The repository itself can also be host to host annotated data for languages for which there is a treebank, but for which there is a much larger gold standard corpus in the UD style that has POS and lemma, but no dependencies. Those larger gold standard datasets can be used to train more accurate taggers.
The objective behind the data in this project is to serve as the basis to train NLP tools, and to serve as the basis for a full treebank.
The corpora in this project should move toward fully adhering to the UD standard, but can start out using on non-UD POS tags (xpos). And can also store additional information apart from the UD fields - specifically a tag according to the Universal Morphology as well as a Interlinear Glossed Text providing a morphological analysis.
Each corpus in the udmorph project should use a ISO 639-3 code, but can deviate from the name of the language provided by the ISO code. And it can provide a sub-classification for the ISO code marking it as a different language, dialect, or variant. For instance, Papiamento is a language used both on Aruba and Curaçao, but there are differences between the two variants, and they both use their own orthography. To reflect that, a corpus from Aruba could be tagged as pap-aru, while a corpus from Curaçao could be marked as pap-cur.