Coreference in Universal Dependencies 1.2 (CorefUD 1.2)

Name: Coreference in Universal Dependencies 1.2 (CorefUD 1.2)
License: https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.2

Popel, Martin; Novák, Michal; Žabokrtský, Zdeněk; Zeman, Daniel; Nedoluzhko, Anna; Acar, Kutay; Bamman, David; Bourgonje, Peter; Cinková, Silvie; Eckhoff, Hanne; Cebiroğlu Eryiğit, Gülşen; Hajič, Jan; Hardmeier, Christian; Haug, Dag; Jørgensen, Tollef; Kåsen, Andre; Krielke, Pauline; Landragin, Frédéric; Lapshinova-Koltunski, Ekaterina; Mæhlum, Petter; Martí, M. Antònia; Mikulová, Marie; Nøklestad, Anders; Ogrodniczuk, Maciej; Øvrelid, Lilja; Pamay Arslan, Tuğba; Recasens, Marta; Solberg, Per Erik; Stede, Manfred; Straka, Milan; Swanson, Daniel; Toldova, Svetlana; Vadász, Noémi; Velldal, Erik; Vincze, Veronika; Zeldes, Amir; Žitkus, Voldemaras

Zobrazit minimální záznam

dc.contributor.author	Popel, Martin
dc.contributor.author	Novák, Michal
dc.contributor.author	Žabokrtský, Zdeněk
dc.contributor.author	Zeman, Daniel
dc.contributor.author	Nedoluzhko, Anna
dc.contributor.author	Acar, Kutay
dc.contributor.author	Bamman, David
dc.contributor.author	Bourgonje, Peter
dc.contributor.author	Cinková, Silvie
dc.contributor.author	Eckhoff, Hanne
dc.contributor.author	Cebiroğlu Eryiğit, Gülşen
dc.contributor.author	Hajič, Jan
dc.contributor.author	Hardmeier, Christian
dc.contributor.author	Haug, Dag
dc.contributor.author	Jørgensen, Tollef
dc.contributor.author	Kåsen, Andre
dc.contributor.author	Krielke, Pauline
dc.contributor.author	Landragin, Frédéric
dc.contributor.author	Lapshinova-Koltunski, Ekaterina
dc.contributor.author	Mæhlum, Petter
dc.contributor.author	Martí, M. Antònia
dc.contributor.author	Mikulová, Marie
dc.contributor.author	Nøklestad, Anders
dc.contributor.author	Ogrodniczuk, Maciej
dc.contributor.author	Øvrelid, Lilja
dc.contributor.author	Pamay Arslan, Tuğba
dc.contributor.author	Recasens, Marta
dc.contributor.author	Solberg, Per Erik
dc.contributor.author	Stede, Manfred
dc.contributor.author	Straka, Milan
dc.contributor.author	Swanson, Daniel
dc.contributor.author	Toldova, Svetlana
dc.contributor.author	Vadász, Noémi
dc.contributor.author	Velldal, Erik
dc.contributor.author	Vincze, Veronika
dc.contributor.author	Zeldes, Amir
dc.contributor.author	Žitkus, Voldemaras
dc.date.accessioned	2024-04-02T12:48:43Z
dc.date.available	2024-04-02T12:48:43Z
dc.date.issued	2024-03-28
dc.identifier.uri	http://hdl.handle.net/11234/1-5478
dc.description	CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
dc.language.iso	grc
dc.language.iso	hbo
dc.language.iso	cat
dc.language.iso	ces
dc.language.iso	eng
dc.language.iso	fra
dc.language.iso	deu
dc.language.iso	hun
dc.language.iso	lit
dc.language.iso	nor
dc.language.iso	chu
dc.language.iso	pol
dc.language.iso	rus
dc.language.iso	spa
dc.language.iso	tur
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.replaces	http://hdl.handle.net/11234/1-5053
dc.rights	Licence CorefUD v1.2
dc.rights.uri	https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.2
dc.source.uri	https://ufal.mff.cuni.cz/corefud
dc.subject	coreference
dc.subject	bridging relations
dc.subject	harmonized annotation
dc.subject	dependency
dc.subject	treebank
dc.title	Coreference in Universal Dependencies 1.2 (CorefUD 1.2)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Michal Novák mnovak@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2023062 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor	Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor	TA ČR FW03010656 Multilingvální asistent pro hledání, analýzu a zpracování informací a podporu rozhodování nationalFunds
sponsor	UK UNCE/24/SSH/009 Multilingual Lens: Investigating Large Text Corpora from Different Methodological Perspectives nationalFunds
size.info	243161 sentences
size.info	4717104 words
size.info	4785411 tokens
files.size	87722711
files.count	1

Soubory tohoto záznamu

Licenční kategorie:

Publicly Available

Licence: Licence CorefUD v1.2

Název: CorefUD-1.2-public.zip
Velikost: 83.66 MB
Formát: application/zip
Popis: data
MD5: 2cdb7e4724ee14a9d1b16d70eb692603

Stáhnout soubor Náhled

Náhled souboru

CorefUD-1.2-public
- data
  - CorefUD_Czech-PDT
    - README.md4 kB
    - cs_pdt-corefud-train.conllu86 MB
    - cs_pdt-corefud-dev.conllu11 MB
    - LICENSE.txt20 kB
  - CorefUD_Catalan-AnCora
    - ca_ancora-corefud-train.conllu26 MB
    - README.md3 kB
    - ca_ancora-corefud-dev.conllu3 MB
    - LICENSE.txt189 B
  - CorefUD_Hungarian-KorKor
    - hu_korkor-corefud-dev.conllu210 kB
    - README.md2 kB
    - LICENSE.txt12 kB
    - hu_korkor-corefud-train.conllu1 MB
  - CorefUD_Old_Church_Slavonic-PROIEL
    - cu_proiel-corefud-train.conllu6 MB
    - README.md2 kB
    - LICENSE.txt20 kB
    - cu_proiel-corefud-dev.conllu1 MB
  - CorefUD_English-ParCorFull
    - README.md2 kB
    - en_parcorfull-corefud-dev.conllu71 kB
    - en_parcorfull-corefud-train.conllu561 kB
    - LICENSE.txt18 kB
  - CorefUD_Norwegian-NynorskNARC
    - README.md2 kB
    - no_nynorsknarc-corefud-train.conllu11 MB
    - no_nynorsknarc-corefud-dev.conllu1 MB
    - LICENSE.txt19 kB
  - CorefUD_Hungarian-SzegedKoref
    - hu_szegedkoref-corefud-train.conllu7 MB
    - README.md1 kB
    - LICENSE.txt18 kB
    - hu_szegedkoref-corefud-dev.conllu952 kB
  - CorefUD_Czech-PCEDT
    - cs_pcedt-corefud-train.conllu118 MB
    - cs_pcedt-corefud-dev.conllu20 MB
    - README.md2 kB
    - LICENSE.txt21 kB
  - CorefUD_Ancient_Greek-PROIEL
    - README.md2 kB
    - LICENSE.txt20 kB
    - grc_proiel-corefud-train.conllu7 MB
    - grc_proiel-corefud-dev.conllu499 kB
  - CorefUD_Spanish-AnCora
    - README.md3 kB
    - es_ancora-corefud-train.conllu29 MB
    - es_ancora-corefud-dev.conllu3 MB
    - LICENSE.txt189 B
  - CorefUD_Polish-PCC
    - README.md2 kB
    - pl_pcc-corefud-dev.conllu6 MB
    - LICENSE.txt19 kB
    - pl_pcc-corefud-train.conllu49 MB
  - CorefUD_German-PotsdamCC
    - de_potsdamcc-corefud-train.conllu2 MB
    - README.md2 kB
    - de_potsdamcc-corefud-dev.conllu364 kB
    - LICENSE.txt20 kB
  - CorefUD_Russian-RuCor
    - ru_rucor-corefud-train.conllu10 MB
    - README.md1 kB
    - LICENSE.txt19 kB
    - ru_rucor-corefud-dev.conllu1 MB
  - CorefUD_Ancient_Hebrew-PTNK
    - hbo_ptnk-corefud-dev.conllu1 MB
    - README.md2 kB
    - hbo_ptnk-corefud-train.conllu884 kB
    - LICENSE.txt18 kB
  - CorefUD_English-GUM
    - en_gum-corefud-train.conllu15 MB
    - README.md1 kB
    - en_gum-corefud-dev.conllu2 MB
    - LICENSE.txt5 kB
  - CorefUD_English-LitBank
    - en_litbank-corefud-dev.conllu1 MB
    - en_litbank-corefud-train.conllu9 MB
    - README.md2 kB
    - LICENSE.txt18 kB
  - CorefUD_Lithuanian-LCC
    - README.md2 kB
    - lt_lcc-corefud-train.conllu2 MB
    - LICENSE.txt1 kB
    - lt_lcc-corefud-dev.conllu298 kB
  - CorefUD_French-Democrat
    - README.md2 kB
    - fr_democrat-corefud-dev.conllu1 MB
    - fr_democrat-corefud-train.conllu14 MB
    - LICENSE.txt19 kB
  - CorefUD_Turkish-ITCC
    - tr_itcc-corefud-dev.conllu521 kB
    - README.md4 kB
    - LICENSE.txt20 kB
    - tr_itcc-corefud-train.conllu4 MB
    - README_old.md3 kB
  - CorefUD_German-ParCorFull
    - README.md2 kB
    - LICENSE.txt18 kB
    - de_parcorfull-corefud-dev.conllu88 kB
    - de_parcorfull-corefud-train.conllu690 kB
  - CorefUD_Norwegian-BokmaalNARC
    - no_bokmaalnarc-corefud-train.conllu13 MB
    - README.md2 kB
    - no_bokmaalnarc-corefud-dev.conllu1 MB
    - LICENSE.txt19 kB
- doc
  - corefud-1.0-format.pdf160 kB
  - README.txt12 kB

Zobrazit minimální záznam