Coreference in Universal Dependencies 1.4 (CorefUD 1.4)
Please use the following text to cite this item or export to a predefined format:
Novák, Michal; et al., 2026,
Coreference in Universal Dependencies 1.4 (CorefUD 1.4), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6108.
Authors
Novák, Michal ; et al.
Item identifier
Project URL
Date issued
2026-02-18
Size
396281 sentences,
6956975 words,
7073892 tokens
Description
CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD comprises 33 datasets covering 19 languages. The datasets are enriched with automatically assigned morphological and syntactic annotations, fully compliant with the standards of the Universal Dependencies project, in cases where manual morphosyntactic annotation is not available or cannot be reliably converted. The data are stored in the CoNLL-U format, with coreference- and bridging-specific information encoded as attribute–value pairs in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The public edition is distributed via LINDAT-CLARIAH-CZ and contains 29 datasets for 19 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 3 for Czech, 1 for Dutch, 4 for English, 3 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Latin, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding test portions. The non-public edition is available internally to ÚFAL members and includes an additional 4 datasets for 2 languages (1 for Dutch and 3 for English) that cannot be redistributed due to licensing restrictions. It also contains the test portions for all datasets. When using any of the harmonized datasets, please review the respective license (available in the same directory as the data) and cite the original resource. Compared to version 1.3, version 1.4 introduces new languages and corpora: Czech-PDTSC, Latin-CorefLat, Dutch-OpenBoek, English-FantasyCoref, and French-LitBankFr. The last three consist of long literary documents. In addition, English-GUM, Czech-PCEDT, and Czech-PDT have been updated to newer releases. A detailed list of changes for each dataset is provided in the corresponding README file.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2023062
Project name:LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
Grantová agentura České republiky
Project code:GX20-16819X
Project name:LUSyD – Language Understanding: from Syntax to Discourse
UK
Project code:UNCE/24/SSH/009
Project name:Multilingual Lens: Investigating Large Text Corpora from Different Methodological Perspectives
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.01.01/00/23_020/0008518
Project name:Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím
Collections
Version History
Files in this item
- Name
- CorefUD-1.4-public.zip
- Size
- 117.65 MB
- Format
- application/zip
- Description
- MD5
- 7cc52acc2e6e79a35a6e65e71ad67d2e

- CorefUD-1.4-public
- data
- CorefUD_Latin-CorefLat
- la_coreflat-corefud-train.conllu1 MB
- README.md2 kB
- LICENSE.txt19 kB
- la_coreflat-corefud-dev.conllu204 kB
- CorefUD_Czech-PDT
- README.md5 kB
- cs_pdt-corefud-train.conllu83 MB
- cs_pdt-corefud-dev.conllu11 MB
- LICENSE.txt20 kB
- CorefUD_Catalan-AnCora
- ca_ancora-corefud-train.conllu26 MB
- README.md3 kB
- ca_ancora-corefud-dev.conllu3 MB
- LICENSE.txt189 B
- CorefUD_Old_Church_Slavonic-PROIEL
- cu_proiel-corefud-train.conllu6 MB
- README.md2 kB
- LICENSE.txt20 kB
- cu_proiel-corefud-dev.conllu1 MB
- CorefUD_Hungarian-KorKor
- hu_korkor-corefud-dev.conllu215 kB
- README.md3 kB
- LICENSE.txt12 kB
- hu_korkor-corefud-train.conllu1 MB
- CorefUD_Czech-PDTSC
- cs_pdtsc-corefud-train.conllu80 MB
- README.md2 kB
- LICENSE.txt20 kB
- cs_pdtsc-corefud-dev.conllu7 MB
- CorefUD_English-ParCorFull
- README.md2 kB
- en_parcorfull-corefud-dev.conllu71 kB
- en_parcorfull-corefud-train.conllu563 kB
- LICENSE.txt18 kB
- CorefUD_Norwegian-NynorskNARC
- README.md2 kB
- no_nynorsknarc-corefud-train.conllu11 MB
- no_nynorsknarc-corefud-dev.conllu1 MB
- LICENSE.txt19 kB
- CorefUD_Hungarian-SzegedKoref
- hu_szegedkoref-corefud-train.conllu7 MB
- README.md2 kB
- LICENSE.txt18 kB
- hu_szegedkoref-corefud-dev.conllu952 kB
- CorefUD_Czech-PCEDT
- cs_pcedt-corefud-train.conllu114 MB
- cs_pcedt-corefud-dev.conllu20 MB
- README.md3 kB
- LICENSE.txt20 kB
- CorefUD_Spanish-AnCora
- README.md3 kB
- es_ancora-corefud-train.conllu29 MB
- es_ancora-corefud-dev.conllu3 MB
- LICENSE.txt189 B
- CorefUD_Ancient_Greek-PROIEL
- README.md2 kB
- LICENSE.txt20 kB
- grc_proiel-corefud-train.conllu7 MB
- grc_proiel-corefud-dev.conllu499 kB
- CorefUD_Polish-PCC
- README.md3 kB
- pl_pcc-corefud-dev.conllu6 MB
- LICENSE.txt19 kB
- pl_pcc-corefud-train.conllu49 MB
- CorefUD_Russian-RuCor
- ru_rucor-corefud-train.conllu10 MB
- README.md1 kB
- LICENSE.txt19 kB
- ru_rucor-corefud-dev.conllu1 MB
- CorefUD_Ancient_Hebrew-PTNK
- hbo_ptnk-corefud-dev.conllu1 MB
- README.md2 kB
- hbo_ptnk-corefud-train.conllu884 kB
- LICENSE.txt18 kB
- CorefUD_German-PotsdamCC
- de_potsdamcc-corefud-train.conllu2 MB
- README.md2 kB
- de_potsdamcc-corefud-dev.conllu365 kB
- LICENSE.txt20 kB
- CorefUD_English-FantasyCoref
- README.md2 kB
- LICENSE.txt19 kB
- en_fantasycoref-corefud-dev.conllu1 MB
- en_fantasycoref-corefud-train.conllu16 MB
- CorefUD_English-GUM
- en_gum-corefud-train.conllu18 MB
- README.md1 kB
- en_gum-corefud-dev.conllu2 MB
- LICENSE.txt1 kB
- CorefUD_English-LitBank
- en_litbank-corefud-dev.conllu1 MB
- en_litbank-corefud-train.conllu9 MB
- README.md2 kB
- LICENSE.txt18 kB
- CorefUD_Hindi-HDTB
- README.md3 kB
- hi_hdtb-corefud-dev.conllu996 kB
- LICENSE.txt20 kB
- hi_hdtb-corefud-train.conllu3 MB
- CorefUD_Lithuanian-LCC
- README.md2 kB
- lt_lcc-corefud-train.conllu2 MB
- LICENSE.txt1 kB
- lt_lcc-corefud-dev.conllu297 kB
- CorefUD_French-Democrat
- README.md2 kB
- fr_democrat-corefud-dev.conllu1 MB
- fr_democrat-corefud-train.conllu14 MB
- LICENSE.txt19 kB
- CorefUD_French-LitBankFr
- README.md2 kB
- fr_litbankfr-corefud-dev.conllu2 MB
- LICENSE.txt19 kB
- fr_litbankfr-corefud-train.conllu12 MB
- CorefUD_French-ANCOR
- fr_ancor-corefud-dev.conllu2 MB
- README.md2 kB
- fr_ancor-corefud-train.conllu25 MB
- LICENSE.txt20 kB
- CorefUD_Turkish-ITCC
- tr_itcc-corefud-dev.conllu521 kB
- README.md3 kB
- LICENSE.txt20 kB
- tr_itcc-corefud-train.conllu4 MB
- CorefUD_Dutch-OpenBoek
- README.md1 kB
- nl_openboek-corefud-train.conllu4 MB
- nl_openboek-corefud-dev.conllu1 MB
- LICENSE.txt18 kB
- CorefUD_Korean-ECMT
- README.md3 kB
- ko_ecmt-corefud-train.conllu32 MB
- LICENSE.txt18 kB
- ko_ecmt-corefud-dev.conllu3 MB
- CorefUD_German-ParCorFull
- README.md2 kB
- LICENSE.txt18 kB
- de_parcorfull-corefud-dev.conllu88 kB
- de_parcorfull-corefud-train.conllu686 kB
- CorefUD_Norwegian-BokmaalNARC
- no_bokmaalnarc-corefud-train.conllu13 MB
- README.md2 kB
- no_bokmaalnarc-corefud-dev.conllu1 MB
- LICENSE.txt19 kB
- CorefUD_Latin-CorefLat
- doc
- corefud-1.0-format.pdf160 kB
- README.txt16 kB
- data

