Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

231. OdiEnCorp 1.0

Creator:: Parida, Shantipriya and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: Odia English Parallel Corpus, Odia Monolingual Corpus, and English-Odia Machine Translation
Language:: Oriya (macrolanguage) and English
Description:: Data ---- We have collected English-Odia parallel and monolingual data from the available public websites for NLP research in Odia. The parallel corpus consists of English-Odia parallel Bible, Odia digital library, and Odisha Goverment websites. It covers bible, literature, goverment of Odisha and its policies. We have processed the raw data collected from the websites, performed alignments (a mix of manual and automatic alignments) and release the corpus in a form ready for various NLP tasks. The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine websites. Because the major portion of data is extracted from Odia-Wikipedia, it covers all kinds of domains. The e-magazines data mostly cover the literature domain. We have preprocessed the monolingual data including de-duplication, text normalization, and sentence segmentation to make it ready for various NLP tasks. Corpus Formats -------------- Both corpora are in simple tab-delimited plain text files. The parallel corpus files have three columns: - the original book/source of the sentence pair - the English sentence - the corresponding Odia sentence The monolingual corpus has a varying number of columns: - each line corresponds to one *paragraph* (or related unit) of the original source - each tab-delimited unit corresponds to one *sentence* in the paragraph Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Sentences #English tokens #Odia tokens ------- --------- ---------------- ------------- Train 27136 706567 604147 Dev 948 21912 19513 Test 1262 28488 24365 ------- --------- ---------------- ------------- Total 29346 756967 648025 Domain Level Statistics ------------------------ Domain Sentences #English tokens #Odia tokens ------------------ --------- ---------------- ------------- Bible 29069 756861 640157 Literature 424 7977 6611 Goverment policies 204 1411 1257 ------------------ --------- ---------------- ------------- Total 29697 766249 648025 Monolingual Corpus Statistics ----------------------------- Paragraphs Sentences #Odia tokens ---------- --------- ------------ 71698 221546 2641308 Domain Level Statistics ----------------------- Domain Paragraphs Sentences #Odia tokens -------------- -------------- --------- ------------- General (wiki) 30468 (42.49%) 102085 1320367 Literature 41230 (57.50%) 119461 1320941 -------------- -------------- --------- ------------- Total 71698 221546 2641308 Citation -------- If you use this corpus, please cite it directly (see above), but please cite also the following paper: Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018 Series: Smart Innovation, Systems and Technologies (SIST) Publisher: Springer Singapore
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

232. OdiEnCorp 2.0

Creator:: Parida, Shantipriya and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, corpus, machine translation, and under-resourced language
Language:: Oriya (macrolanguage) and English
Description:: Data ----- We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both English and Odia text such as grammar and bilingual literature books. We also included parallel text from multiple public websites such as Odia Wikipedia, Odia digital library, and Odisha Government websites. The parallel corpus covers many domains: the Bible, other literature, Wiki data relating to many topics, Government policies, and general conversation. We have processed the raw data collected from the books, websites, performed sentence alignments (a mix of manual and automatic alignments) and released the corpus in a form suitable for various NLP tasks. Corpus Format ------------- OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each with three tab-delimited columns: - a coarse indication of the domain - the English sentence - the corresponding Odia sentence The corpus is shuffled at the level of sentence pairs. The coarse domains are: books ... prose text dict ... dictionaries and phrasebooks govt ... partially formal text odiencorp10 ... OdiEnCorp 1.0 (mix of domains) pmindia ... PMIndia (the original corpus) wikipedia ... sentences and phrases from Wikipedia Data Statistics --------------- The statistics of the current release are given below. Note that the statistics differ from those reported in the paper due to deduplication at the level of sentence pairs. The deduplication was performed within each of the dev set, test set and training set and taking the coarse domain indication into account. It is still possible that the same sentence pair appears more than once within the same set (dev/test/train) if it came from different domains, and it is also possible that a sentence pair appears in several sets (dev/test/train). Parallel Corpus Statistics -------------------------- Dev Dev Dev Test Test Test Train Train Train Sents # EN # OD Sents # EN # OD Sents # EN # OD books 3523 42011 36723 3895 52808 45383 3129 40461 35300 dict 3342 14580 13838 3437 14807 14110 5900 21591 20246 govt - - - - - - 761 15227 13132 odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005 pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636 wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122 Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441 "Sents" are the counts of the sentence pairs in the given set (dev/test/train) and domain (books/dict/...). "# EN" and "# OD" are approximate counts of words (simply space-delimited, without tokenization) in English and Odia The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring the set and domain and deduplicating again, this number drops to 94857. Citation -------- If you use this corpus, please cite the following paper: @inproceedings{parida2020odiencorp, title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation}, author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar}, booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation}, pages={14--19}, year={2020} }
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

233. OLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus

Creator:: Mayer, Jiří, Straka, Milan, Hajič jr., Jan, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: image and corpus
Subject:: OpenScore Lieder, pianoform scores, MusicXML, and Linearized MusicXML
Language:: No linguistic content
Description:: OLiMPiC: OpenScore Lieder Linearized MusicXML Piano Corpus is a dataset containing synthetic and scanned images of pianoform music scores. The scores and the scanned images originate from the OpenScore Lieder Corpus https://github.com/OpenScore/Lieder . OLiMPiC contains the scores in MusicXML and Linearized MusicXML encoding, suitable for evaluation with the TEDn metric. The official train/dev/test split is also provided.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

234. Optimal reference translation of English-Czech WMT2020

Creator:: Kloudová, Věra, Mraček, David, Bojar, Ondřej, and Popel, Martin
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: translational equivalence, reference translation, optimal reference translation, and WMT
Language:: Czech and English
Description:: We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations can be used in assessments of excellent machine translations. We selected 50 documents (online news articles, with 579 paragraphs in total) from the 130 English documents included in the WMT2020 news test (http://www.statmt.org/wmt20/) with the aim to preserve diversity (style, genre etc.) of the selection. In addition to the official Czech reference translation provided by the WMT organizers (P1), we hired two additional translators (P2 and P3, native Czech speakers) via a professional translation agency, resulting in three independent translations. The main contribution of this dataset are two additional translations (i.e. optimal reference translations N1 and N2), done jointly by two translators-cum-theoreticians with an extreme care for various aspects of translation quality, while taking into account the translations P1-P3. We publish also internal comments (in Czech) for some of the segments. Translation N1 should be closer to the English original (with regards to the meaning and linguistic structure) and female surnames use the Czech feminine suffix (e.g. "Mai" is translated as "Maiová"). Translation N2 is more free, trying to be more creative, idiomatic and entertaining for the readers and following the typical style used in Czech media, while still preserving the rules of functional equivalence. Translation N2 is missing for the segments where it was not deemed necessary to provide two alternative translations. For applications/analyses needing translation of all segments, this should be interpreted as if N2 is the same as N1 for a given segment. We provide the dataset in two formats: OpenDocument spreadsheet (odt) and plain text (one file for each translation and the English original). Some words were highlighted using different colors during the creation of optimal reference translations; this highlighting and comments are present only in the odt format (some comments refer to row numbers in the odt file). Documents are separated by empty lines and each document starts with a special line containing the document name (e.g. "# upi.205735"), which allows alignment with the original WMT2020 news test. For the segments where N2 translations are missing in the odt format, the respective N1 segments are used instead in the plain-text format.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

235. Optimal Reference Translations from English to Czech

Creator:: Zouhar, Vilém, Kloudová, Věra, Popel, Martin, and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: translation, evaluation, and optimal reference translation
Language:: English and Czech
Description:: This corpus contains annotations of translation quality from English to Czech in seven categories on both segment- and document-level. There are 20 documents in total, each with 4 translations (evaluated by each annotator in paralel) of 8 segments (can be longer than one sentence). Apart from the evaluation, the annotators also proposed their own, improved versions of the translations. There were 11 annotators in total, on expertise levels ranging from non-experts to professional translators.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

236. Package of word embeddings of Czech from a large corpus

Creator:: Kyjánek, Lukáš and Bonami, Olivier
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, computationalLexicon, and lexicalConceptualResource
Subject:: word embeddings, word vectors, large corpus, word2vec, skipgram, and cbow
Language:: Czech
Description:: This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9 (Křen et al. 2022). The minimum frequency threshold for including a word in the model was 10 occurrences in the corpus. The original lemmatisation and tagging included in the corpus were used for disambiguation. In the case of word embeddings of word forms, units comprise word forms and their tag from a positional tagset (cf. https://wiki.korpus.cz/doku.php/en:pojmy:tag) separated by '>', e.g., kočka>NNFS1-----A----. The published package provides models trained on both tokens and lemmas. In addition, the models combine training algorithms (CBOW and Skipgram) and dimensions of the resulting vectors (100 or 500), while the training window and negative sampling remained the same during the training. The package also includes files with frequencies of word forms (vocab-frequencies.forms) and lemmas (vocab-frequencies.lemmas).
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

237. ParaDi 2.0

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: multiword expressions, light verb construction, paraphrases, and idioms
Language:: Czech
Description:: ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

238. ParaDi 2.0 (2018-01-24)

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: multiword expressions, light verb construction, paraphrases, and idioms
Language:: Czech
Description:: ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs. The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

239. ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs

Creator:: Barančíková, Petra and Kettnerová, Václava
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, machineReadableDictionary, and lexicalConceptualResource
Subject:: light verb construction and paraphrases
Language:: Czech
Description:: Dictionary of single verb paraphrases of Czech light verb constructions.
Rights:: Public Domain Mark (PD), http://creativecommons.org/publicdomain/mark/1.0/, and PUB

240. ParCzech 3.0

Creator:: Kopp, Matyáš, Stankov, Vladislav, Bojar, Ondřej, Hladká, Barbora, and Straňák, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: Parliament of the Czech Republic, Chamber of Deputies, stenographic protocols, TEI encoding, and speech corpus
Language:: Czech
Description:: The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and the current 8th term (2017-Mar 2021). The protocols are provided in their original HTML format, Parla-CLARIN TEI format, and the format suitable for Automatic Speech Recognition. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

231. OdiEnCorp 1.0

232. OdiEnCorp 2.0

233. OLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus

234. Optimal reference translation of English-Czech WMT2020

235. Optimal Reference Translations from English to Czech

236. Package of word embeddings of Czech from a large corpus

237. ParaDi 2.0

238. ParaDi 2.0 (2018-01-24)

239. ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs

240. ParCzech 3.0

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from