This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Hindi Web Texts

Please use the following text to cite this item or export to a predefined format:
Bojar, Ondřej; Straňák, Pavel and Zeman, Daniel, 2011, Hindi Web Texts, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B.
Date issued
2011-11-23
Size
308000000 tokens,
18000000 sentences
Language(s)
Description
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2014-03-21 00:00:00
1*
2011-11-23 00:00:00
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
UMC004-Hindi-web-texts.originals-and-plaintexts.tgz
Size
1.34 GB
Format
application/x-gzip
Description
gzip Archive
MD5
890d409f7e932dca8a6eea990ac86c12
Preview
  File Preview