This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Hindi Web Texts

Please use the following text to cite this item or export to a predefined format:
Bojar, Ondřej; Straňák, Pavel and Zeman, Daniel, 2011, Hindi Web Texts, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B.
Date issued
2011-11-23
Size
308000000 tokens,
18000000 sentences
Language(s)
Description
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
UMC004-Hindi-web-texts.originals-and-plaintexts.tgz
Size
1.34 GB
Format
application/x-gzip
Description
The complete data in both the original form and cleaned up
MD5
890d409f7e932dca8a6eea990ac86c12
Preview
  File Preview
    • UMC004-Hindi-web-texts.originals-and-plaintexts.tgz2 GB