Show simple item record

 
dc.contributor.author Rosa, Rudolf
dc.contributor.author Zouhar, Vilém
dc.date.accessioned 2022-11-11T16:09:44Z
dc.date.available 2022-11-11T16:09:44Z
dc.date.issued 2022-11-11
dc.identifier.uri http://hdl.handle.net/11234/1-4922
dc.description This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
dc.language.iso ces
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.replaces http://hdl.handle.net/11234/1-1731
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.source.uri https://github.com/ufal/bilingual-abstracts-corpus
dc.subject parallel corpus
dc.subject scientific texts
dc.subject abstracts
dc.title Czech and English abstracts of ÚFAL papers (2022-11-11)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Rudolf Rosa rosa@ufal.mff.cuni.cz Charles University in Prague, UFAL
contact.person Vilém Zouhar vilem.zouhar@gmail.com ETH Zürich, Department of Computer Science
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 15723/2014 Modelování závislostní syntaxe napříč jazyky nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
size.info 2659 entries
size.info 11000 sentences
size.info 255000 words
files.size 3818008
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
corpus.jsonl
Size
3.64 MB
Format
Unknown
Description
The corpus
MD5
666b8f01db3671c4db8a298ff3b8eee7
 Download file

Show simple item record