Show simple item record

 
dc.contributor.author Nedoluzhko, Anna
dc.contributor.author Novák, Michal
dc.contributor.author Popel, Martin
dc.contributor.author Žabokrtský, Zdeněk
dc.contributor.author Zeman, Daniel
dc.date.accessioned 2021-03-11T22:47:55Z
dc.date.available 2021-03-11T22:47:55Z
dc.date.issued 2021-03-11
dc.identifier.uri http://hdl.handle.net/11234/1-3510
dc.description CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.1 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. References to original resources whose harmonized versions are contained in the public edition of CorefUD 0.1: - Catalan-AnCora: Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345 - Czech-PCEDT: Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 169–176, Portorož, Slovenia. European Language Resources Association. - Czech-PDT: Hajič, J., Bejček, E., Hlaváčová, J., Mikulová, M., Straka, M., Štěpánek, J., and Štěpánková, B. (2020). Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pages 5208–5218, Marseille, France. European Language Resources Association. - English-GUM: Zeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, 51(3):581–612. - English-ParCorFull: Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association. - French-Democrat: Landragin, F. (2016). Description, modélisation et détection automatique des chaı̂nes de référence (DEMOCRAT). Bulletin de l’Association Française pour l’Intelligence Artificielle, (92):11–15. - German-ParCorFull: Lapshinova-Koltunski, E., Hardmeier, C., and Krielke, P. (2018). ParCorFull: a Parallel Corpus Annotated with Full Coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association - German-PotsdamCC: Bourgonje, P. and Stede, M. (2020). The Potsdam Commentary Corpus 2.2: Extending annotations for shallow discourse parsing. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1061–1066, Marseille, France. European Language Resources Association. - Hungarian-SzegedKoref: Vincze, V., Hegedűs, K., Sliz-Nagy, A., and Farkas, R. (2018). SzegedKoref: A Hungarian Coreference Corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association. - Lithuanian-LCC: Žitkus, V. and Butkienė, R. (2018). Coreference Annotation Scheme and Corpus for Lithuanian Language. In Fifth International Conference on Social Networks Analysis, Management and Security, SNAMS 2018, Valencia, Spain, October 15-18, 2018, pages 243–250. IEEE. - Polish-PCC: Ogrodniczuk, M., Glowińska, K., Kopeć, M., Savary, A., and Zawisławska, M. (2013). Polish coreference corpus. In Human Language Technology. Challenges for Computer Science and Linguistics - 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7-9, 2013. Revised Selected Papers, volume 9561 of Lecture Notes in Computer Science, pages 215–226. Springer. - Russian-RuCor: Toldova, S., Roytberg, A., Ladygina, A. A., Vasilyeva, M. D., Azerkovich, I. L., Kurzukov,M., Sim, G., Gorshkov, D. V., Ivanova, A., Nedoluzhko, A., and Grishina, Y. (2014). Evaluating Anaphora and Coreference Resolution for Russian. In Komp’juternaja lingvistika i intellektual’nye tehnologii. Po materialam ezhegodnoj Mezhdunarodnoj konferencii Dialog, pages 681–695. - Spanish-AnCora: Recasens, M. and Martí, M. A. (2010). AnCora-CO: Coreferentially Annotated Corpora for Spanish and Catalan. Language Resources and Evaluation, 44(4):315–345 References to original resources whose harmonized versions are contained in the ÚFAL-internal edition of CorefUD 0.1: - Dutch-COREA: Hendrickx, I., Bouma, G., Coppens, F., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.-M., Van Der Vloet, J., and Verschelde, J.-L. (2008). A coreference corpus and resolution system for Dutch. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association. - English-ARRAU: Uryupina, O., Artstein, R., Bristot, A., Cavicchio, F., Delogu, F., Rodriguez, K. J., and Poesio, M. (2020). Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus. Natural Language Engineering, 26(1):95–128. - English-OntoNotes: Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., Ramshaw, L., and Xue, N. (2011). Ontonotes: A large training corpus for enhanced processing. In Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, pages 54–63, New York. Springer-Verlag. - English-PCEDT: Nedoluzhko, A., Novák, M., Cinková, S., Mikulová, M., and Mírovský, J. (2016). Coreference in Prague Czech-English Dependency Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 169–176, Portorož, Slovenia. European Language Resources Association.
dc.language.iso cat
dc.language.iso ces
dc.language.iso nld
dc.language.iso eng
dc.language.iso fra
dc.language.iso deu
dc.language.iso hun
dc.language.iso lit
dc.language.iso pol
dc.language.iso rus
dc.language.iso spa
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825303
dc.relation.isreplacedby http://hdl.handle.net/11234/1-4598
dc.rights Licence CorefUD v0.1
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-0.1
dc.source.uri https://ufal.mff.cuni.cz/corefud
dc.subject dependency
dc.subject treebank
dc.subject coreference
dc.subject bridging relations
dc.subject harmonized annotation
dc.title Coreference in Universal Dependencies 0.1 (CorefUD 0.1)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Michal Novák mnovak@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
contact.person Zdeněk Žabokrtský zabokrtsky@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor Grantová agentura České Republiky 19-14534S Popis slovotvorné struktury českých slov na základě jazykových dat nationalFunds
sponsor European Union EC/H2020/825303 Bergamot - Browser-based Multilingual Translation euFunds info:eu-repo/grantAgreement/EC/H2020/825303
size.info 189777 sentences
size.info 3865136 tokens
size.info 3874322 words
files.size 69974488
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Licence CorefUD v0.1
GNU General Public License, version 3.0 Distributed under Creative Commons
Icon
Name
CorefUD-0.1-public.zip
Size
66.73 MB
Format
application/zip
Description
data
MD5
8724a9b0e11c56bdb24626debe2a1524
 Download file  Preview
 File Preview  
  • CorefUD-0.1-public
    • data
      • CorefUD_Czech-PCEDT
        • cs_pcedt-corefud-train.conllu104 MB
        • cs_pcedt-corefud-dev.conllu15 MB
        • README.md1 kB
        • LICENSE.txt21 kB
      • CorefUD_Polish-PCC
        • README.md1 kB
        • pl_pcc-corefud-dev.conllu14 MB
        • LICENSE.txt19 kB
        • pl_pcc-corefud-train.conllu113 MB
      • CorefUD_French-Democrat
        • README.md1 kB
        • fr_democrat-corefud-dev.conllu1 MB
        • fr_democrat-corefud-train.conllu15 MB
        • LICENSE.txt19 kB
      • CorefUD_Lithuanian-LCC
        • README.md1 kB
        • lt_lcc-corefud-train.conllu2 MB
        • LICENSE.txt1 kB
        • lt_lcc-corefud-dev.conllu302 kB
      • CorefUD_German-PotsdamCC
        • de_potsdamcc-corefud-train.conllu2 MB
        • README.md1 kB
        • de_potsdamcc-corefud-dev.conllu366 kB
        • LICENSE.txt20 kB
      • CorefUD_German-ParCorFull
        • README.md1 kB
        • LICENSE.txt18 kB
        • de_parcorfull-corefud-dev.conllu90 kB
        • de_parcorfull-corefud-train.conllu708 kB
      • CorefUD_English-ParCorFull
        • README.md1 kB
        • en_parcorfull-corefud-dev.conllu70 kB
        • en_parcorfull-corefud-train.conllu547 kB
        • LICENSE.txt18 kB
      • CorefUD_Czech-PDT
        • README.md1 kB
        • cs_pdt-corefud-train.conllu79 MB
        • cs_pdt-corefud-dev.conllu10 MB
        • LICENSE.txt20 kB
      • CorefUD_Russian-RuCor
        • ru_rucor-corefud-train.conllu10 MB
        • README.md1 kB
        • LICENSE.txt19 kB
        • ru_rucor-corefud-dev.conllu1 MB
      • CorefUD_English-GUM
        • en_gum-corefud-train.conllu7 MB
        • README.md1 kB
        • en_gum-corefud-dev.conllu1 MB
        • LICENSE.txt3 kB
      • CorefUD_Spanish-AnCora
        • README.md1 kB
        • es_ancora-corefud-train.conllu30 MB
        • es_ancora-corefud-dev.conllu3 MB
        • LICENSE.txt34 kB
      • CorefUD_Catalan-AnCora
        • ca_ancora-corefud-train.conllu27 MB
        • README.md1 kB
        • ca_ancora-corefud-dev.conllu3 MB
        • LICENSE.txt34 kB
      • CorefUD_Hungarian-SzegedKoref
        • hu_szegedkoref-corefud-train.conllu9 MB
        • README.md1 kB
        • LICENSE.txt18 kB
        • hu_szegedkoref-corefud-dev.conllu1 MB
    • doc
      • File-format-description.pdf167 kB
      • README.txt6 kB

Show simple item record