This is not the latest version of this item. The latest version can be found here.
Coreference in Universal Dependencies 1.0 (CorefUD 1.0)
Please use the following text to cite this item or export to a predefined format:
Nedoluzhko, Anna; et al., 2022,
Coreference in Universal Dependencies 1.0 (CorefUD 1.0), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-4698.
Authors
Nedoluzhko, Anna ; et al.
Item identifier
Project URL
Date issued
2022-04-06
Size
194344 sentences,
4061606 words,
4112513 tokens
Description
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation).
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2018101
Project name:LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
Grantová agentura České republiky
Project code:GX20-16819X
Project name:LUSyD – Language Understanding: from Syntax to Discourse
Grantová agentura České Republiky
Project code:19-14534S
Project name:Popis slovotvorné struktury českých slov na základě jazykových dat
European Union
Project code:EC/H2020/825303
Project name:Bergamot - Browser-based Multilingual Translation
Collections
Version History
Files in this item
- Name
- CorefUD-1.0-public.zip
- Size
- 69.92 MB
- Format
- application/zip
- Description
- data
- MD5
- f1a4e2301bdc5e3896546c22c6a94852

- CorefUD-1.0-public
- data
- CorefUD_Polish-PCC
- README.md1 kB
- pl_pcc-corefud-dev.conllu6 MB
- LICENSE.txt19 kB
- pl_pcc-corefud-train.conllu48 MB
- CorefUD_Czech-PCEDT
- cs_pcedt-corefud-train.conllu105 MB
- cs_pcedt-corefud-dev.conllu18 MB
- README.md1 kB
- LICENSE.txt21 kB
- CorefUD_French-Democrat
- README.md2 kB
- fr_democrat-corefud-dev.conllu1 MB
- fr_democrat-corefud-train.conllu14 MB
- LICENSE.txt19 kB
- CorefUD_Lithuanian-LCC
- README.md1 kB
- lt_lcc-corefud-train.conllu2 MB
- LICENSE.txt1 kB
- lt_lcc-corefud-dev.conllu298 kB
- CorefUD_German-PotsdamCC
- de_potsdamcc-corefud-train.conllu2 MB
- README.md1 kB
- de_potsdamcc-corefud-dev.conllu365 kB
- LICENSE.txt20 kB
- CorefUD_English-ParCorFull
- README.md2 kB
- en_parcorfull-corefud-dev.conllu69 kB
- en_parcorfull-corefud-train.conllu536 kB
- LICENSE.txt18 kB
- CorefUD_German-ParCorFull
- README.md2 kB
- LICENSE.txt18 kB
- de_parcorfull-corefud-dev.conllu88 kB
- de_parcorfull-corefud-train.conllu692 kB
- CorefUD_Czech-PDT
- README.md1 kB
- cs_pdt-corefud-train.conllu78 MB
- cs_pdt-corefud-dev.conllu10 MB
- LICENSE.txt20 kB
- CorefUD_Russian-RuCor
- ru_rucor-corefud-train.conllu10 MB
- README.md1 kB
- LICENSE.txt19 kB
- ru_rucor-corefud-dev.conllu1 MB
- CorefUD_English-GUM
- en_gum-corefud-train.conllu10 MB
- README.md1 kB
- en_gum-corefud-dev.conllu1 MB
- LICENSE.txt3 kB
- CorefUD_Spanish-AnCora
- README.md2 kB
- es_ancora-corefud-train.conllu36 MB
- es_ancora-corefud-dev.conllu4 MB
- LICENSE.txt189 B
- CorefUD_Hungarian-SzegedKoref
- hu_szegedkoref-corefud-train.conllu9 MB
- README.md1 kB
- LICENSE.txt18 kB
- hu_szegedkoref-corefud-dev.conllu1 MB
- CorefUD_Catalan-AnCora
- ca_ancora-corefud-train.conllu33 MB
- README.md2 kB
- ca_ancora-corefud-dev.conllu4 MB
- LICENSE.txt189 B
- CorefUD_Polish-PCC
- doc
- corefud-1.0-format.pdf160 kB
- README.txt8 kB
- data

