This is not the latest version of this item. The latest version can be found here.
Hindi Web Texts
Please use the following text to cite this item or export to a predefined format:
Bojar, Ondřej; Straňák, Pavel and Zeman, Daniel, 2011,
Hindi Web Texts, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B.
Authors
Item identifier
Date issued
2011-11-23
Size
308000000 tokens,
18000000 sentences
Language(s)
Description
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
Acknowledgement
European Union
Project code:FP7-ICT-2007-3-231720
Project name:EuroMatrix Plus
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:7E09003
Project name:EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- UMC004-Hindi-web-texts.originals-and-plaintexts.tgz
- Size
- 1.34 GB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 890d409f7e932dca8a6eea990ac86c12

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

