Coreference in Universal Dependencies 0.2 (CorefUD 0.2)

Name: Coreference in Universal Dependencies 0.2 (CorefUD 0.2)
License: https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.2

Nedoluzhko, Anna; Novák, Michal; Popel, Martin; Žabokrtský, Zdeněk; Zeman, Daniel

Show simple item record

dc.contributor.author	Nedoluzhko, Anna
dc.contributor.author	Novák, Michal
dc.contributor.author	Popel, Martin
dc.contributor.author	Žabokrtský, Zdeněk
dc.contributor.author	Zeman, Daniel
dc.date.accessioned	2021-12-14T14:19:46Z
dc.date.available	2021-12-14T14:19:46Z
dc.date.issued	2021-12-10
dc.identifier.uri	http://hdl.handle.net/11234/1-4598
dc.description	CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.2 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 0.2 consists of exactly the same datasets as the version 0.1. All automatically parsed datasets were re-parsed for v0.2 using UDPipe 2 with models trained on UD 2.6. Catalan-AnCora, Spanish-AnCora and English-GUM have been updated to match the their UD 2.9 versions.
dc.language.iso	cat
dc.language.iso	ces
dc.language.iso	nld
dc.language.iso	eng
dc.language.iso	fra
dc.language.iso	deu
dc.language.iso	hun
dc.language.iso	lit
dc.language.iso	pol
dc.language.iso	rus
dc.language.iso	spa
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825303
dc.relation.replaces	http://hdl.handle.net/11234/1-3510
dc.relation.isreplacedby	http://hdl.handle.net/11234/1-4698
dc.rights	Licence CorefUD v0.2
dc.rights.uri	https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.2
dc.source.uri	https://ufal.mff.cuni.cz/corefud
dc.subject	dependency
dc.subject	treebank
dc.subject	coreference
dc.subject	bridging relations
dc.subject	harmonized annotation
dc.title	Coreference in Universal Dependencies 0.2 (CorefUD 0.2)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Michal Novák mnovak@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor	Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor	Grantová agentura České Republiky 19-14534S Popis slovotvorné struktury českých slov na základě jazykových dat nationalFunds
sponsor	European Union EC/H2020/825303 Bergamot - Browser-based Multilingual Translation euFunds info:eu-repo/grantAgreement/EC/H2020/825303
size.info	189767 sentences
size.info	3965446 words
size.info	4016574 tokens
files.size	74160228
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Licence CorefUD v0.2

Name: CorefUD-0.2-public.zip
Size: 70.72 MB
Format: application/zip
Description: data
MD5: aeb2eb61761e9c27a52f8d12c816e72a

Download file Preview

File Preview

CorefUD-0.2-public
- data
  - CorefUD_Czech-PCEDT
    - cs_pcedt-corefud-train.conllu104 MB
    - cs_pcedt-corefud-dev.conllu15 MB
    - README.md1 kB
    - LICENSE.txt21 kB
  - CorefUD_Polish-PCC
    - README.md1 kB
    - pl_pcc-corefud-dev.conllu14 MB
    - LICENSE.txt19 kB
    - pl_pcc-corefud-train.conllu113 MB
  - CorefUD_French-Democrat
    - README.md1 kB
    - fr_democrat-corefud-dev.conllu1 MB
    - fr_democrat-corefud-train.conllu15 MB
    - LICENSE.txt19 kB
  - CorefUD_Lithuanian-LCC
    - README.md1 kB
    - lt_lcc-corefud-train.conllu2 MB
    - LICENSE.txt1 kB
    - lt_lcc-corefud-dev.conllu304 kB
  - CorefUD_German-PotsdamCC
    - de_potsdamcc-corefud-train.conllu2 MB
    - README.md1 kB
    - de_potsdamcc-corefud-dev.conllu366 kB
    - LICENSE.txt20 kB
  - CorefUD_English-ParCorFull
    - README.md1 kB
    - en_parcorfull-corefud-dev.conllu70 kB
    - en_parcorfull-corefud-train.conllu548 kB
    - LICENSE.txt18 kB
  - CorefUD_German-ParCorFull
    - README.md1 kB
    - LICENSE.txt18 kB
    - de_parcorfull-corefud-dev.conllu90 kB
    - de_parcorfull-corefud-train.conllu707 kB
  - CorefUD_Czech-PDT
    - README.md1 kB
    - cs_pdt-corefud-train.conllu79 MB
    - cs_pdt-corefud-dev.conllu10 MB
    - LICENSE.txt20 kB
  - CorefUD_Russian-RuCor
    - ru_rucor-corefud-train.conllu10 MB
    - README.md1 kB
    - LICENSE.txt19 kB
    - ru_rucor-corefud-dev.conllu1 MB
  - CorefUD_English-GUM
    - en_gum-corefud-train.conllu9 MB
    - README.md1 kB
    - en_gum-corefud-dev.conllu1 MB
    - LICENSE.txt3 kB
  - CorefUD_Spanish-AnCora
    - README.md1 kB
    - es_ancora-corefud-train.conllu38 MB
    - es_ancora-corefud-dev.conllu4 MB
    - LICENSE.txt189 B
  - CorefUD_Catalan-AnCora
    - ca_ancora-corefud-train.conllu35 MB
    - README.md1 kB
    - ca_ancora-corefud-dev.conllu4 MB
    - LICENSE.txt189 B
  - CorefUD_Hungarian-SzegedKoref
    - hu_szegedkoref-corefud-train.conllu9 MB
    - README.md1 kB
    - LICENSE.txt18 kB
    - hu_szegedkoref-corefud-dev.conllu1 MB
- doc
  - File-format-description.pdf167 kB
  - README.txt6 kB

Show simple item record