This is not the latest version of this item. The latest version can be found here.
Coreference in Universal Dependencies 1.3 (CorefUD 1.3)
Please use the following text to cite this item or export to a predefined format:
Novák, Michal; et al., 2025,
Coreference in Universal Dependencies 1.3 (CorefUD 1.3), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5896.
Authors
Novák, Michal ; et al.
Item identifier
Project URL
Date issued
2025-04-17
Size
303350 sentences,
5635895 words,
5698427 tokens
Description
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file).
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2023062
Project name:LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
Grantová agentura České republiky
Project code:GX20-16819X
Project name:LUSyD – Language Understanding: from Syntax to Discourse
UK
Project code:UNCE/24/SSH/009
Project name:Multilingual Lens: Investigating Large Text Corpora from Different Methodological Perspectives
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.01.01/00/23_020/0008518
Project name:Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím
Collections
Files in this item
- Name
- CorefUD-1.3-public.zip
- Size
- 97.8 MB
- Format
- application/zip
- Description
- Zip
- MD5
- 95acda0b065942f8e3dae25c7ae45ddc

- CorefUD-1.3-public
- data
- CorefUD_Czech-PDT
- README.md4 kB
- cs_pdt-corefud-train.conllu83 MB
- cs_pdt-corefud-dev.conllu11 MB
- LICENSE.txt20 kB
- CorefUD_Catalan-AnCora
- ca_ancora-corefud-train.conllu26 MB
- README.md3 kB
- ca_ancora-corefud-dev.conllu3 MB
- LICENSE.txt189 B
- CorefUD_Hungarian-KorKor
- hu_korkor-corefud-dev.conllu215 kB
- README.md3 kB
- LICENSE.txt12 kB
- hu_korkor-corefud-train.conllu1 MB
- CorefUD_Old_Church_Slavonic-PROIEL
- cu_proiel-corefud-train.conllu6 MB
- README.md2 kB
- LICENSE.txt20 kB
- cu_proiel-corefud-dev.conllu1 MB
- CorefUD_English-ParCorFull
- README.md2 kB
- en_parcorfull-corefud-dev.conllu71 kB
- en_parcorfull-corefud-train.conllu563 kB
- LICENSE.txt18 kB
- CorefUD_Norwegian-NynorskNARC
- README.md2 kB
- no_nynorsknarc-corefud-train.conllu11 MB
- no_nynorsknarc-corefud-dev.conllu1 MB
- LICENSE.txt19 kB
- CorefUD_Hungarian-SzegedKoref
- hu_szegedkoref-corefud-train.conllu7 MB
- README.md1 kB
- LICENSE.txt18 kB
- hu_szegedkoref-corefud-dev.conllu951 kB
- CorefUD_Czech-PCEDT
- cs_pcedt-corefud-train.conllu114 MB
- cs_pcedt-corefud-dev.conllu20 MB
- README.md2 kB
- LICENSE.txt21 kB
- CorefUD_Ancient_Greek-PROIEL
- README.md2 kB
- LICENSE.txt20 kB
- grc_proiel-corefud-train.conllu7 MB
- grc_proiel-corefud-dev.conllu499 kB
- CorefUD_Spanish-AnCora
- README.md3 kB
- es_ancora-corefud-train.conllu29 MB
- es_ancora-corefud-dev.conllu3 MB
- LICENSE.txt189 B
- CorefUD_Polish-PCC
- README.md2 kB
- pl_pcc-corefud-dev.conllu6 MB
- LICENSE.txt19 kB
- pl_pcc-corefud-train.conllu49 MB
- CorefUD_Russian-RuCor
- ru_rucor-corefud-train.conllu10 MB
- README.md1 kB
- LICENSE.txt19 kB
- ru_rucor-corefud-dev.conllu1 MB
- CorefUD_Ancient_Hebrew-PTNK
- hbo_ptnk-corefud-dev.conllu1 MB
- README.md2 kB
- hbo_ptnk-corefud-train.conllu884 kB
- LICENSE.txt18 kB
- CorefUD_German-PotsdamCC
- de_potsdamcc-corefud-train.conllu2 MB
- README.md2 kB
- de_potsdamcc-corefud-dev.conllu363 kB
- LICENSE.txt20 kB
- CorefUD_English-GUM
- en_gum-corefud-train.conllu18 MB
- README.md1 kB
- en_gum-corefud-dev.conllu2 MB
- LICENSE.txt5 kB
- CorefUD_English-LitBank
- en_litbank-corefud-dev.conllu1 MB
- en_litbank-corefud-train.conllu9 MB
- README.md2 kB
- LICENSE.txt18 kB
- CorefUD_Hindi-HDTB
- README.md3 kB
- hi_hdtb-corefud-dev.conllu996 kB
- LICENSE.txt20 kB
- hi_hdtb-corefud-train.conllu3 MB
- CorefUD_Lithuanian-LCC
- README.md2 kB
- lt_lcc-corefud-train.conllu2 MB
- LICENSE.txt1 kB
- lt_lcc-corefud-dev.conllu298 kB
- CorefUD_French-Democrat
- README.md2 kB
- fr_democrat-corefud-dev.conllu1 MB
- fr_democrat-corefud-train.conllu14 MB
- LICENSE.txt19 kB
- CorefUD_French-ANCOR
- fr_ancor-corefud-dev.conllu2 MB
- README.md2 kB
- fr_ancor-corefud-train.conllu25 MB
- LICENSE.txt20 kB
- CorefUD_Turkish-ITCC
- tr_itcc-corefud-dev.conllu521 kB
- README.md3 kB
- LICENSE.txt20 kB
- tr_itcc-corefud-train.conllu4 MB
- CorefUD_Korean-ECMT
- README.md3 kB
- ko_ecmt-corefud-train.conllu32 MB
- LICENSE.txt18 kB
- ko_ecmt-corefud-dev.conllu3 MB
- CorefUD_German-ParCorFull
- README.md2 kB
- LICENSE.txt18 kB
- de_parcorfull-corefud-dev.conllu87 kB
- de_parcorfull-corefud-train.conllu684 kB
- CorefUD_Norwegian-BokmaalNARC
- no_bokmaalnarc-corefud-train.conllu13 MB
- README.md2 kB
- no_bokmaalnarc-corefud-dev.conllu1 MB
- LICENSE.txt19 kB
- CorefUD_Czech-PDT
- doc
- corefud-1.0-format.pdf160 kB
- README.txt13 kB
- data

