Original context has metadata only: false / Subject: lemmatization

Start Over Subject lemmatization Original context has metadata only false Date Unknown

1. CALEM (Comprehensive Arabic LEMmas)

Creator:: Namly, Driss, Bouzoubaa, Karim, and El Jihad, Abdelhamid
Publisher:: ALELM
Type:: text, lexicon, and lexicalConceptualResource
Subject:: lexicon, lemmatization, and stemming;
Language:: Arabic
Description:: Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS. It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow: 757 Arabic particles 2,464,631 verbal stems 4,735,587 nominal stems The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data. Citation: – Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. “Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique.” In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

2. CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials

Creator:: Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, mlmodel, and languageDescription
Subject:: CoNLL 2017, tokenizer, POS tagger, lemmatization, tagger, parser, dependency parser, morphology, and treebank
Language:: Multiple languages
Description:: Baseline UDPipe models for CoNLL 2017 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.1 and are evaluated using the official evaluation script. The models are trained on a slightly different split of the official UD 2.0 CoNLL 2017 training data, so called baselinemodel split, in order to allow comparison of models even during the shared task. This baselinemodel split of UD 2.0 CoNLL 2017 training data is available for download. Furthermore, we also provide UD 2.0 CoNLL 2017 training data with automatically predicted morphology. We utilize the baseline models on development data and perform 10-fold jack-knifing (each fold is predicted with a model trained on the rest of the folds) on the training data. Finally, we supply all required data and hyperparameter values needed to replicate the baseline models.
Rights:: Licence Universal Dependencies v2.0, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0, and PUB

3. CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials

Creator:: Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, mlmodel, and languageDescription
Subject:: CoNLL 2018, tokenizer, POS tagger, lemmatization, tagger, parser, dependency parser, morphology, and treebank
Language:: Multiple languages
Description:: Baseline UDPipe models for CoNLL 2018 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.2 and are evaluated using the official evaluation script. The models were trained using a custom data split for treebanks where no development data is provided. Also, we trained an additional "Mixed" model, which uses 200 sentences from every training data. All information needed to replicate the model training (hyperparameters, modified train-dev split, and pre-computed word embeddings for the parser) are included in the archive. Additionaly, we provide UD 2.2 CoNLL 2018 training data with automatically predicted morphology. We utilize the baseline models on development data and perform 10-fold jack-knifing (each fold is predicted with a model trained on the rest of the folds) on the training data.
Rights:: Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB

4. Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)

Creator:: Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: tokenizer, POS tagger, lemmatization, parser, dependency parser, MorfFlex CZ 2.0, and PDT-C 1.0
Language:: Czech
Description:: Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model . To use these models, you need UDPipe version 2.1, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

5. EvaLatin 2020 models for UDPipe 2 (2020-08-31)

Creator:: Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: POS tagger, lemmatization, and tagger
Language:: Latin
Description:: POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#evalatin20_models . To use these models, you need UDPipe version at least 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

6. Indonesian web corpus (idWac)

Creator:: Medveď, Marek and Suchomel, Vít
Publisher:: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Type:: text and corpus
Subject:: corpus, lemmatization, and PoS tagging
Language:: Indonesian
Description:: Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd (http://septinalarasati.com/morphind/).
Rights:: NLP Centre Web Corpus License, https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC, and ACA

7. Korpus ORAL: sestavení, lemmatizace a morfologické značkování

Creator:: Kopřivová, Marie, Komrsková, Zuzana, Lukeš, David, and Poukarová, Petra
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: spoken Czech, spoken language corpora, lemmatization, tagging, morphological analysis, mluvená čeština, korpusy mluveného jazyka, lemmatizace, tagování, and morfologická analýza
Language:: Czech
Description:: The goal of this paper is to provide an overview of the structure and contents of the soon-to-be available ORAL corpus, which combines previously published corpora (ORAL2006, ORAL2008 and ORAL2013) with newly transcribed material into a single conveniently accessible and more richly annotated resource, about 6 million running words in length. The recordings and corresponding transcripts span a decade between 2002 and 2011; most of them capture interactions of mutually well-acquainted speakers, in informal situations and natural settings. The corpus is complemented by amarginal portion of more formal data, mostly public talks. It is tagged and lemmatized, and an effort was made to adapt existing tools (targeted at written language) to yield better results on spoken data. We hope the availability of such a resource will spawn further discussions on the morphological and syntactic analysis of spoken language, perhaps resulting in more radical departures in the future from the part-of-speech classification inherited from the linguistic analysis of written language.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

8. Persian Morphologically Segmented Lexicon 0.5

Creator:: Ansari, Ebrahim, Žabokrtský, Zdeněk, Haghdoost, Hamid, and Nikravesh, Mahshid
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text, lexicon, and lexicalConceptualResource
Subject:: morphological analysis, and lemmatization
Language:: Persian
Description:: This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

9. Prague Dependency Treebank 3.5

Creator:: Hajič, Jan, Bejček, Eduard, Bémová, Alevtina, Buráňová, Eva, Hajičová, Eva, Havelka, Jiří, Homola, Petr, Kárník, Jiří, Kettnerová, Václava, Klyueva, Natalia, Kolářová, Veronika, Kučová, Lucie, Lopatková, Markéta, Mikulová, Marie, Mírovský, Jiří, Nedoluzhko, Anna, Pajas, Petr, Panevová, Jarmila, Poláková, Lucie, Rysová, Magdaléna, Sgall, Petr, Spoustová, Johanka, Straňák, Pavel, Synková, Pavlína, Ševčíková, Magda, Štěpánek, Jan, Urešová, Zdeňka, Vidová Hladká, Barbora, Zeman, Daniel, Zikánová, Šárka, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank, dependency, tectogrammatics, topic-focus articulation, multiword expressions, coreference, bridging relations, discourse, morphology, syntax, tokenization, lemmatization, clauses, semantics, semantic relations, lexical semantics, and lexicon
Language:: Czech
Description:: The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

10. The Model latinpipe-evalatin24-240520 for LatinPipe 2024

Creator:: Straka, Milan, Straková, Jana, and Gamba, Federica
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: tool and toolService
Subject:: LatinPipe, EvaLatin 2024, POS tagging, lemmatization, and dependency parsing
Language:: Latin
Description:: The latinpipe-evalatin24-240520 is a PhilBerta-based model for LatinPipe 2024 <https://github.com/ufal/evalatin2024-latinpipe>, performing tagging, lemmatization, and dependency parsing of Latin, based on the winning entry to the EvaLatin 2024 <https://circse.github.io/LT4HALA/2024/EvaLatin> shared task. It is released under the CC BY-NC-SA 4.0 license.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

1. CALEM (Comprehensive Arabic LEMmas)

2. CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials

3. CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials

4. Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)

5. EvaLatin 2020 models for UDPipe 2 (2020-08-31)

6. Indonesian web corpus (idWac)

7. Korpus ORAL: sestavení, lemmatizace a morfologické značkování

8. Persian Morphologically Segmented Lexicon 0.5

9. Prague Dependency Treebank 3.5

10. The Model latinpipe-evalatin24-240520 for LatinPipe 2024

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Rights

Show values starting with

Subject

Show values starting with

Type

Original context has metadata only

Harvested from