Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

201. CUBBITT Translation Models (en-cs) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and Czech
Description:: CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->cs: 27.6 cs->en: 34.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

202. CUBBITT Translation Models (en-fr) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and French
Description:: CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2014 (BLEU): en->fr: 38.2 fr->en: 36.7 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

203. CUBBITT Translation Models (en-pl) (v1.0)

Creator:: Popel, Martin, Tomková, Markéta, Tomek, Jakub, Kaiser, Łukasz, Uszkoreit, Jakob, Bojar, Ondřej, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: machine translation, neural machine translation, transformer, and cubbitt
Language:: English and Polish
Description:: CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

204. CWC2011

Creator:: Spoustová, Johanka and Spousta, Miroslav
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, Czech, and web
Language:: Czech
Description:: Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details. and GA405/09/0278
Rights:: Creative Commons - Attribution 3.0 Unported (CC BY 3.0), http://creativecommons.org/licenses/by/3.0/, and PUB

205. Cyril Belica : Kookkurrenzdatenbank CCDB

Publisher:: Institut für Deutsche Sprache
Type:: toolService
Language:: German
Description:: A co-occurrence database, developed by the Institut fuer Deutsche Sprache, for research in the field of collocation analysis in modern German. The database holds over 200,000 analysed words that can be browsed or searched and shown in context.
Rights:: Not specified

206. Czech and English abstracts of ÚFAL papers

Creator:: Rosa, Rudolf
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, scientific texts, and abstracts
Language:: Czech and English
Description:: This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague, as reported in the institute's system Biblio. For each publication, the authors are obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

207. Czech and English abstracts of ÚFAL papers (2022-11-11)

Creator:: Rosa, Rudolf and Zouhar, Vilém
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parallel corpus, scientific texts, and abstracts
Language:: Czech and English
Description:: This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics, Charles University in Prague. For each publication record, the authors are obliged to provide both the original abstract (in Czech or English), and its translation (English or Czech) in the internal Biblio system. The data was filtered for duplicates and missing entries, ensuring that every record is bilingual. Additionally, records of published papers which are indexed by SemanticScholar contain the respective link. The dataset was created from September 2022 image of the Biblio database and is stored in JSONL format, with each line corresponding to one record.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

208. Czech and English Reflective Dataset (CEReD)

Creator:: Štefánik, Michal and Nehyba, Jan
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: reflective writing, reflective categories, pre-service teachers, and hand annotation
Language:: English and Czech
Description:: The database contains annotated reflective sentences, which fall into the categories of reflective writing according to Ullmann's (2019) model. The dataset is ready to replicate these categories' prediction using machine learning. Available from: https://anonymous.4open.science/repository/c856595c-dfc2-48d7-aa3d-0ccc2648c4dc/data
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

209. Czech Court Decisions Corpus (CzCDC 1.0)

Creator:: Novotná, Tereza and Harašta, Jakub
Publisher:: Faculty of Law, Masaryk University
Type:: text and corpus
Subject:: legal texts, judicial decisions, and court decisions
Language:: Czech
Description:: This is the Czech Court Decisions Corpus (CzCDC 1.0). This corpus contains whole texts of the decisions from three top-tier courts (Supreme, Supreme Administrative and Constitutional court) in Czech republic. Court decisions are published from 1st January 1993 to 30th September 2018. The language of decisions is Czech. Content of decisions is unedited and obtained directly from the competent court. Decisions are in .txt format in three folders divided by courts. Corpus contains three .csv files containing the list of all decisions with four columns: - name of the file: exact file name of a decision with extension .txt; - decision identifier (docket number): official identification of the decision as issued by the court; - date of decision: in ISO 8601 (YYYY-MM-DD); - court abbreviation: SupCo for Supreme Court, SupAdmCo for Supreme Administrative Court, ConCo for Constitutional Court Statistics: - SupCo: 111 977 decisions, 23 699 639 lines, 224 061 129 words, 1 462 948 200 bits; - SupAdmCo: 52 660 decisions, 18 069 993 lines, 137 839 985 words, 1 067 826 507 bits; - ConCo: 73 086 decisions, 6 178 371 lines, 98 623 753 words, 664 657 755 bits - all courts combined: 237 723 decisions, 47 948 003 lines, 460 524 867 words, 3 195 432 462 bits
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

210. Czech Grammar Agreement Dataset for Evaluation of Language Models

Creator:: Baisa, Vít
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: agreement, past tense verb suffix, language model, and training data
Language:: Czech
Description:: AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

201. CUBBITT Translation Models (en-cs) (v1.0)

202. CUBBITT Translation Models (en-fr) (v1.0)

203. CUBBITT Translation Models (en-pl) (v1.0)

204. CWC2011

205. Cyril Belica : Kookkurrenzdatenbank CCDB

206. Czech and English abstracts of ÚFAL papers

207. Czech and English abstracts of ÚFAL papers (2022-11-11)

208. Czech and English Reflective Dataset (CEReD)

209. Czech Court Decisions Corpus (CzCDC 1.0)

210. Czech Grammar Agreement Dataset for Evaluation of Language Models

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Original context has metadata only

Harvested from