This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

Please use the following text to cite this item or export to a predefined format:
Novák, Michal; et al., 2026, Coreference in Universal Dependencies 1.4 (CorefUD 1.4), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-6108.
Date issued
2026-02-18
Size
396281 sentences,
6956975 words,
7073892 tokens
Description
CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD comprises 33 datasets covering 19 languages. The datasets are enriched with automatically assigned morphological and syntactic annotations, fully compliant with the standards of the Universal Dependencies project, in cases where manual morphosyntactic annotation is not available or cannot be reliably converted. The data are stored in the CoNLL-U format, with coreference- and bridging-specific information encoded as attribute–value pairs in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The public edition is distributed via LINDAT-CLARIAH-CZ and contains 29 datasets for 19 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 3 for Czech, 1 for Dutch, 4 for English, 3 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Latin, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding test portions. The non-public edition is available internally to ÚFAL members and includes an additional 4 datasets for 2 languages (1 for Dutch and 3 for English) that cannot be redistributed due to licensing restrictions. It also contains the test portions for all datasets. When using any of the harmonized datasets, please review the respective license (available in the same directory as the data) and cite the original resource. Compared to version 1.3, version 1.4 introduces new languages and corpora: Czech-PDTSC, Latin-CorefLat, Dutch-OpenBoek, English-FantasyCoref, and French-LitBankFr. The last three consist of long literary documents. In addition, English-GUM, Czech-PCEDT, and Czech-PDT have been updated to newer releases. A detailed list of changes for each dataset is provided in the corresponding README file.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
CorefUD-1.4-public.zip
Size
117.65 MB
Format
application/zip
Description
MD5
7cc52acc2e6e79a35a6e65e71ad67d2e
Preview
  File Preview