dc.contributor.author |
Popel, Martin |
dc.contributor.author |
Novák, Michal |
dc.contributor.author |
Žabokrtský, Zdeněk |
dc.contributor.author |
Zeman, Daniel |
dc.contributor.author |
Nedoluzhko, Anna |
dc.contributor.author |
Acar, Kutay |
dc.contributor.author |
Bamman, David |
dc.contributor.author |
Bourgonje, Peter |
dc.contributor.author |
Cinková, Silvie |
dc.contributor.author |
Eckhoff, Hanne |
dc.contributor.author |
Cebiroğlu Eryiğit, Gülşen |
dc.contributor.author |
Hajič, Jan |
dc.contributor.author |
Hardmeier, Christian |
dc.contributor.author |
Haug, Dag |
dc.contributor.author |
Jørgensen, Tollef |
dc.contributor.author |
Kåsen, Andre |
dc.contributor.author |
Krielke, Pauline |
dc.contributor.author |
Landragin, Frédéric |
dc.contributor.author |
Lapshinova-Koltunski, Ekaterina |
dc.contributor.author |
Mæhlum, Petter |
dc.contributor.author |
Martí, M. Antònia |
dc.contributor.author |
Mikulová, Marie |
dc.contributor.author |
Nøklestad, Anders |
dc.contributor.author |
Ogrodniczuk, Maciej |
dc.contributor.author |
Øvrelid, Lilja |
dc.contributor.author |
Pamay Arslan, Tuğba |
dc.contributor.author |
Recasens, Marta |
dc.contributor.author |
Solberg, Per Erik |
dc.contributor.author |
Stede, Manfred |
dc.contributor.author |
Straka, Milan |
dc.contributor.author |
Swanson, Daniel |
dc.contributor.author |
Toldova, Svetlana |
dc.contributor.author |
Vadász, Noémi |
dc.contributor.author |
Velldal, Erik |
dc.contributor.author |
Vincze, Veronika |
dc.contributor.author |
Zeldes, Amir |
dc.contributor.author |
Žitkus, Voldemaras |
dc.date.accessioned |
2024-04-02T12:48:43Z |
dc.date.available |
2024-04-02T12:48:43Z |
dc.date.issued |
2024-03-28 |
dc.identifier.uri |
http://hdl.handle.net/11234/1-5478 |
dc.description |
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file). |
dc.language.iso |
grc |
dc.language.iso |
hbo |
dc.language.iso |
cat |
dc.language.iso |
ces |
dc.language.iso |
eng |
dc.language.iso |
fra |
dc.language.iso |
deu |
dc.language.iso |
hun |
dc.language.iso |
lit |
dc.language.iso |
nor |
dc.language.iso |
chu |
dc.language.iso |
pol |
dc.language.iso |
rus |
dc.language.iso |
spa |
dc.language.iso |
tur |
dc.publisher |
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation.replaces |
http://hdl.handle.net/11234/1-5053 |
dc.rights |
Licence CorefUD v1.2 |
dc.rights.uri |
https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.2 |
dc.source.uri |
https://ufal.mff.cuni.cz/corefud |
dc.subject |
coreference |
dc.subject |
bridging relations |
dc.subject |
harmonized annotation |
dc.subject |
dependency |
dc.subject |
treebank |
dc.title |
Coreference in Universal Dependencies 1.2 (CorefUD 1.2) |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
dc.rights.label |
PUB |
has.files |
yes |
branding |
LINDAT / CLARIAH-CZ |
contact.person |
Michal Novák mnovak@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor |
Ministerstvo školství, mládeže a tělovýchovy České republiky LM2023062 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds |
sponsor |
Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds |
sponsor |
TA ČR FW03010656 Multilingvální asistent pro hledání, analýzu a zpracování informací a podporu rozhodování nationalFunds |
sponsor |
UK UNCE/24/SSH/009 Multilingual Lens: Investigating Large Text Corpora from Different Methodological Perspectives nationalFunds |
size.info |
243161 sentences |
size.info |
4717104 words |
size.info |
4785411 tokens |
files.size |
87722711 |
files.count |
1 |