Number of results to display per page
Search Results
72. EVALD 4.0 – Evaluator of Discourse
- Creator:
- Novák, Michal, Mírovský, Jiří, Rysová, Kateřina, Rysová, Magdaléna, and Hajičová, Eva
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- text coherence, discourse, automatic evaluation, and non-native speakers
- Language:
- Czech
- Description:
- EVALD 4.0 serves for automatic evaluation of surface coherence (cohesion) in Czech texts written by native speakers of Czech.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
73. EVALD 4.0 for Beginners – Evaluator of Discourse
- Creator:
- Novák, Michal, Mírovský, Jiří, Rysová, Kateřina, Rysová, Magdaléna, and Hajičová, Eva
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- text coherence, discourse, automatic evaluation, and non-native speakers
- Language:
- Czech
- Description:
- EVALD 4.0 for Beginners is a software that serves for automatic evaluation of Czech texts written by non-native speakers of Czech – language beginners.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
74. EVALD 4.0 for Foreigners – Evaluator of Discourse
- Creator:
- Novák, Michal, Mírovský, Jiří, Rysová, Kateřina, Rysová, Magdaléna, and Hajičová, Eva
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- text coherence, discourse, automatic evaluation, and non-native speakers
- Language:
- Czech
- Description:
- EVALD 4.0 for Foreigners is a software for automatic evaluation of surface coherence (cohesion) in Czech texts written by non-native speakers of Czech.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
75. FAUST cs-en 0.5
- Creator:
- Hajič, Jan, Mareček, David, Fučíková, Eva, Cinková, Silvie, Štěpánek, Jan, Mikulová, Marie, and Popel, Martin
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- noisy texts, parallel corpus, and machine translation
- Language:
- English and Czech
- Description:
- This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308). Each original (noisy) sentence was normalized (clean1 and clean2) and translated to English independently by two translators.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
76. FERNET-C5
- Creator:
- Lehečka, Jan and Švec, Jan
- Publisher:
- University of West Bohemia, Department of Cybernetics
- Type:
- text, mlmodel, and languageDescription
- Subject:
- Czech and BERT
- Language:
- Czech
- Description:
- The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
77. FicTree 1.0
- Creator:
- Jelínek, Tomáš, Hnátková, Milena, and Skoumalová, Hana
- Publisher:
- Charles University, Faculty of Arts, Institute of Theoretical and Computational Linguistics
- Type:
- text and corpus
- Subject:
- treebank
- Language:
- Czech
- Description:
- FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator). The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
78. ForFun 1.0
- Creator:
- Mikulová, Marie and Bejček, Eduard
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- service and toolService
- Subject:
- form, function, database, and syntax
- Language:
- Czech
- Description:
- ForFun is a database of linguistic forms and their syntactic functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The purpose of the Prague Database of Forms and Functions (ForFun) is to help the linguists to study the form-function relation, which we assume to be one of the principal tasks of both theoretical linguistics and natural language processing. A prototypical question to be asked is "What purposes does a preposition 'po' serve for" or "What are the linguistic means in the sentence that can express the meaning 'a destination of an action'?". There are almost 1500 distinct forms (besides the 'po' preposition) and 65 distinct functions (besides the 'destination').
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
79. HaCzech: Dataset of Handwritten Czech
- Creator:
- Procházka, Štěpán and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- htr, ocr, manuscripts, chronicles, and handwriting
- Language:
- Czech
- Description:
- The dataset of handwritten Czech text lines, sourced from two chronicles (municipal chronicles 1931-1944, school chronicles 1913-1933). The dataset comprises 25k lines machine-extracted from scanned pages, and provides manual annotation of text contents for a subset of size 2k.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
80. Hausa Visual Genome 1.0
- Creator:
- Abdulmumin, Idris, Das, Satya Ranja, Dawud, Musa Abdullahi, Parida, Shantipriya, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Sa'id, Panda, Subhadarshi, Bojar, Ondřej, Galadanci, Bashir Shehu, and Bello, Bello Shehu
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, machine translation, image captioning, image annotation, and neural machine translation
- Language:
- Hausa and English
- Description:
- Data ------- Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as the dataset Hindi Visual Genome 1.1 has. We automatically translated the English captions to Hausa and manually post-edited, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments is available for the multi-modal task. This challenge test set was created in Hindi Visual Genome by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats ----------------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hausa Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width, and Height columns indicate the rectangular region in the image described by the caption. Data Statistics -------------------- The statistics of the current release are given below. Parallel Corpus Statistics ----------------------------------- Dataset Segments English Words Hausa Words ---------- -------- ------------- ----------- Train 28930 143106 140981 Dev 998 4922 4857 Test 1595 7853 7736 Challenge Test 1400 8186 8752 ---------- -------- ------------- ----------- Total 32923 164067 162326 The word counts are approximate, prior to tokenization. Citation ----------- If you use this corpus, please cite the following paper: @InProceedings{abdulmumin-EtAl:2022:LREC, author = {Abdulmumin, Idris and Dash, Satya Ranjan and Dawud, Musa Abdullahi and Parida, Shantipriya and Muhammad, Shamsuddeen and Ahmad, Ibrahim Sa'id and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Galadanci, Bashir Shehu and Bello, Bello Shehu}, title = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}", booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {6471--6479}, url = {https://aclanthology.org/2022.lrec-1.694} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB