Subject: web texts - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Subject web texts

Creator:: Bojar, Ondřej, Straňák, Pavel, and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: news and web texts
Language:: Hindi
Description:: A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens and FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
Rights:: Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0), http://creativecommons.org/licenses/by-nc/3.0/, and PUB

Search