Harvested from: LINDAT/CLARIAH-CZ repository - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Harvested from LINDAT/CLARIAH-CZ repository Date Unknown

91. BulTreeBank POS Corpus

Type:: corpus
Language:: Bulgarian
Description:: Written, synchronic, general, manually annotated; 50 000 tokens, 2600 sentences extracted from the BulTreeBank Text Archive in order to contain the most frequent ambiguity classes in Bulgarian
Rights:: Not specified

92. BulTreeBank Stopword List

Type:: lexicalConceptualResource
Language:: Bulgarian
Description:: 805 prepositions, pronouns, etc stop words, UTF-16 list of wordforms
Rights:: Not specified

93. BulTreeBank Text Archive

Type:: corpus
Language:: Bulgarian
Description:: 72 000 000 tokens, 15% fiction, 78% newspapers and 7% legal texts, government bulletins and others
Rights:: Not specified

94. BulTreeBank Tokenizer

Creator:: Simov, Kiril
Publisher:: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Type:: toolService
Description:: The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Rights:: Not specified

95. BUSCANEO

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Tool for neologism extraction.
Rights:: Not specified

96. BushBank

Creator:: Grác, Marek
Publisher:: Masaryk University, NLP Centre
Type:: text and corpus
Subject:: interannotator agreement, corpus, chunks, phrases, and clauses
Language:: Czech
Description:: Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

97. Bústia Neològica Escolar

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan and Spanish
Description:: Terminology management
Rights:: Not specified

98. Bwananet

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: toolService
Language:: Catalan, English, and Spanish
Description:: Tool for querying the Technical Corpus of the Institut Universitari de Lingüística Aplicada.
Rights:: Not specified

99. C4Corpus (CC BY-NC part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB

100. C4Corpus (CC BY-NC-ND part)

Creator:: Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
Publisher:: Technische Universität Darmstadt
Type:: text and corpus
Subject:: CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
Language:: Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
Description:: A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

« Previous
Next »
1
2
…
6
7
8
9
10
11
12
13
14
…
112
113