Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository Date Unknown

191. COSTRA 1.0: A Dataset of Complex Sentence Transformations

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: sentences, sentence embeddings, paraphrases, and semantic relations
Language:: Czech
Description:: COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

192. COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons

Creator:: Barančíková, Petra and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: paraphrases, sentence embeddings, evaluation, and sentence
Language:: Czech
Description:: Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

193. Covid-19 Thesaurus

Creator:: Fener, Patricia
Publisher:: Institute for scientific and technical information (Inist) - CNRS/UAR76
Type:: thesaurus, text, and lexicalConceptualResource
Subject:: COVID-19, SARS coronavirus, Middle-East coronavirus, SARS-CoV, and MERS-CoV
Language:: French and English
Description:: This bilingual thesaurus (French-English), developed at Inist-CNRS, covers the concepts from the emerging COVID-19 outbreak which reminds the past SARS coronavirus outbreak and Middle East coronavirus outbreak. This thesaurus is based on the vocabulary used in scientific publications for SARS-CoV-2 and other coronaviruses, like SARS-CoV and MERS-CoV. It provides a support to explore the coronavirus infectious diseases. The thesaurus can be browsed and queried by humans and machines on the Loterre portal (https://www.loterre.fr), via an API and an rdf triplestore. It is also downloadable in PDF, SKOS, csv and json-ld formats. The thesaurus is made available under a CC-by 4.0 license.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), PUB, and http://creativecommons.org/licenses/by/4.0/

194. Croatian Dependency Treebank

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Format:: application/octet-stream
Type:: corpus
Language:: Croatian
Description:: Manually tagged dependency treebank, analytical layer according to the PDT formalism adapted for Croatian
Rights:: Not specified

195. Croatian Lemmatization Server

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: toolService
Language:: Croatian
Description:: On line service for lemmatization, full POS or MSD tagging of Croatian texts.
Rights:: Not specified

196. Croatian Morphological Lexicon

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: lexicalConceptualResource
Language:: Croatian
Description:: 110,000+ lemmas; 3,900,000+ word-forms, MulText East lexica format
Rights:: Not specified

197. Croatian National Corpus

Publisher:: University of Zagreb, Faculty of Humanities and Social Sciences
Type:: corpus
Language:: Croatian
Description:: This is the reference corpus of standard Croatian. In its 3.0 version, which is accessible via noSketch Engine, it has 216.8 million tokens. In terms of annotation, the corpus is tokenised, lemmatised and tagged for MSDs (morphosyntactic descriptions).
Rights:: Not specified

198. CsEnVi Pairwise Parallel Corpora

Creator:: Hoang, Duc Tam and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: corpus, Vietnamese, parallel corpus, Czech-Vietnamese corpus, and English-Vietnamese corpus
Language:: Czech, English, and Vietnamese
Description:: CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

199. CST's lemmatiser

Publisher:: Center for Sprogteknologi, University of Copenhagen
Type:: toolService
Language:: Danish, Dutch, English, German, Modern Greek (1453-), Icelandic, Norwegian, Russian, Slovenian, and Swedish
Description:: 1) Fully automatic rule based lemmatization of inflected languages 2) Fully automatic training of lemmatization rules based on full form-lemma list
Rights:: Not specified

200. CST's lemmatizer

Creator:: Jongejan, Bart
Publisher:: Københavns Universitet, Center for Sprogteknologi (CST)
Type:: toolService
Description:: 1) Fully automatic rule based lemmatization of inflected languages 2) Fully automatic training of lemmatization rules based on full form-lemma list
Rights:: Not specified

« Previous
Next »
1
2
…
16
17
18
19
20
21
22
23
24
…
112
113

191. COSTRA 1.0: A Dataset of Complex Sentence Transformations

192. COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons

193. Covid-19 Thesaurus

194. Croatian Dependency Treebank

195. Croatian Lemmatization Server

196. Croatian Morphological Lexicon

197. Croatian National Corpus

198. CsEnVi Pairwise Parallel Corpora

199. CST's lemmatiser

200. CST's lemmatizer

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Original context has metadata only

Harvested from