Show simple item record Nedoluzhko, Anna Novák, Michal Popel, Martin Žabokrtský, Zdeněk Zeman, Daniel 2021-12-14T14:19:46Z 2021-12-14T14:19:46Z 2021-12-10
dc.description CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.2 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 0.2 consists of exactly the same datasets as the version 0.1. All automatically parsed datasets were re-parsed for v0.2 using UDPipe 2 with models trained on UD 2.6. Catalan-AnCora, Spanish-AnCora and English-GUM have been updated to match the their UD 2.9 versions.
dc.language.iso cat
dc.language.iso ces
dc.language.iso nld
dc.language.iso eng
dc.language.iso fra
dc.language.iso deu
dc.language.iso hun
dc.language.iso lit
dc.language.iso pol
dc.language.iso rus
dc.language.iso spa
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825303
dc.rights Licence CorefUD v0.2
dc.subject dependency
dc.subject treebank
dc.subject coreference
dc.subject bridging relations
dc.subject harmonized annotation
dc.title Coreference in Universal Dependencies 0.2 (CorefUD 0.2)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
contact.person Michal Novák Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor Grantová agentura České Republiky 19-14534S Popis slovotvorné struktury českých slov na základě jazykových dat nationalFunds
sponsor European Union EC/H2020/825303 Bergamot - Browser-based Multilingual Translation euFunds info:eu-repo/grantAgreement/EC/H2020/825303 189767 sentences 3965446 words 4016574 tokens
files.size 74160228
files.count 1

 Files in this item

This item is
Publicly Available
and licensed under:
Licence CorefUD v0.2
Distributed under Creative Commons
70.72 MB
 Download file  Preview
 File Preview  
  • CorefUD-0.2-public
    • data
      • CorefUD_Czech-PCEDT
        • cs_pcedt-corefud-train.conllu104 MB
        • cs_pcedt-corefud-dev.conllu15 MB
        • README.md1 kB
        • LICENSE.txt21 kB
      • CorefUD_Polish-PCC
        • README.md1 kB
        • pl_pcc-corefud-dev.conllu14 MB
        • LICENSE.txt19 kB
        • pl_pcc-corefud-train.conllu113 MB
      • CorefUD_French-Democrat
        • README.md1 kB
        • fr_democrat-corefud-dev.conllu1 MB
        • fr_democrat-corefud-train.conllu15 MB
        • LICENSE.txt19 kB
      • CorefUD_Lithuanian-LCC
        • README.md1 kB
        • lt_lcc-corefud-train.conllu2 MB
        • LICENSE.txt1 kB
        • lt_lcc-corefud-dev.conllu304 kB
      • CorefUD_German-PotsdamCC
        • de_potsdamcc-corefud-train.conllu2 MB
        • README.md1 kB
        • de_potsdamcc-corefud-dev.conllu366 kB
        • LICENSE.txt20 kB
      • CorefUD_English-ParCorFull
        • README.md1 kB
        • en_parcorfull-corefud-dev.conllu70 kB
        • en_parcorfull-corefud-train.conllu548 kB
        • LICENSE.txt18 kB
      • CorefUD_German-ParCorFull
        • README.md1 kB
        • LICENSE.txt18 kB
        • de_parcorfull-corefud-dev.conllu90 kB
        • de_parcorfull-corefud-train.conllu707 kB
      • CorefUD_Czech-PDT
        • README.md1 kB
        • cs_pdt-corefud-train.conllu79 MB
        • cs_pdt-corefud-dev.conllu10 MB
        • LICENSE.txt20 kB
      • CorefUD_Russian-RuCor
        • ru_rucor-corefud-train.conllu10 MB
        • README.md1 kB
        • LICENSE.txt19 kB
        • ru_rucor-corefud-dev.conllu1 MB
      • CorefUD_English-GUM
        • en_gum-corefud-train.conllu9 MB
        • README.md1 kB
        • en_gum-corefud-dev.conllu1 MB
        • LICENSE.txt3 kB
      • CorefUD_Spanish-AnCora
        • README.md1 kB
        • es_ancora-corefud-train.conllu38 MB
        • es_ancora-corefud-dev.conllu4 MB
        • LICENSE.txt189 B
      • CorefUD_Catalan-AnCora
        • ca_ancora-corefud-train.conllu35 MB
        • README.md1 kB
        • ca_ancora-corefud-dev.conllu4 MB
        • LICENSE.txt189 B
      • CorefUD_Hungarian-SzegedKoref
        • hu_szegedkoref-corefud-train.conllu9 MB
        • README.md1 kB
        • LICENSE.txt18 kB
        • hu_szegedkoref-corefud-dev.conllu1 MB
    • doc
      • File-format-description.pdf167 kB
      • README.txt6 kB

Show simple item record