This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years 2011, 2013 and 2014.
For each sentence, at most 10000 paraphrases were included (randomly selected from the full set).
The goal of using this dataset is to improve automatic evaluation of machine translation outputs.
If you use this work, please cite the following paper:
Tamchyna Aleš, Barančíková Petra: Automatic and Manual Paraphrases for MT Evaluation. In proceedings of LREC, 2016.
Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset.
Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
A collection of parallel corpora: English-Lithuanian (2m words), Lithuanian-English (0,06m words), Czech-Lithuanian (0,8m words), Lithuanian-Czech (0,02m words). All the corpora are online-searcheable via one interface at http://donelaitis.vdu.lt/main_en.php?id=4&nr=1_2. The corpus is still being updated with new texts.
The segment from the 1938 Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) Issue No. 39 encourages blood donating in preparation for the anticipated war. The information given on the Czechoslovak Red Cross' activities is illustrated by footage of blood donations, Red Cross nurses taking blood samples, and the canning of blood for military purposes.
Sculptor Bohumil Kafka works on a statue of Josef Mánes in a fragmented segment from the Ufa žurnál (Ufa Journal) 1939, issue no. 200. The unveiling of the monument by the Rudolfinum, including a speech by Professor Vratislav Nechleba in a fragmented segment from Československé filmové noviny (Czechoslovak Film News) 1951, issue no. 52. Kafka at Prague Zoo working on a study of a lion for the Milan Rastislav Štefánik's monument in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1937, issue no. 5. Kafka with politician Milan Hodža in the artist´s studio in Prague-Dejvice.
Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.