Skip to search
Skip to main content
Skip to first result
Search
Search Results
Type:
corpus
Language:
Bulgarian
Description:
Written, synchronic, general, manually annotated; 50 000 tokens, 2600 sentences extracted from the BulTreeBank Text Archive in order to contain the most frequent ambiguity classes in Bulgarian
Rights:
Not specified
Type:
lexicalConceptualResource
Language:
Bulgarian
Description:
805 prepositions, pronouns, etc stop words, UTF-16 list of wordforms
Rights:
Not specified
Type:
corpus
Language:
Bulgarian
Description:
72 000 000 tokens, 15% fiction, 78% newspapers and 7% legal texts, government bulletins and others
Rights:
Not specified
Creator:
Simov, Kiril
Publisher:
Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:
toolService
Description:
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Rights:
Not specified
Publisher:
Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:
toolService
Language:
Catalan and Spanish
Description:
Tool for neologism extraction.
Rights:
Not specified
Creator:
Grác, Marek
Publisher:
Masaryk University, NLP Centre
Type:
text and corpus
Subject:
interannotator agreement , corpus , chunks , phrases , and clauses
Language:
Czech
Description:
Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
Rights:
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) , http://creativecommons.org/licenses/by-nc-nd/3.0/ , and PUB
Publisher:
Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:
toolService
Language:
Catalan and Spanish
Description:
Terminology management
Rights:
Not specified
Publisher:
Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:
toolService
Language:
Catalan , English , and Spanish
Description:
Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.
Rights:
Not specified
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) , http://creativecommons.org/licenses/by-nc-nd/4.0/ , and PUB