Files in this item
Download all files in item (66.13 MB)This item is
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Publicly Available
and licensed under:Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
- Name
- README.txt
- Size
- 4.37 KB
- Format
- Text file
- Description
- Brief description of corpus formats
- MD5
- 0461668ddf034e11de3958528b64962f
HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats ================================================ This file describes the file formats of the Hindi-English and Hindi-only corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5. More details about the preparation of the corpora can be found in the paper: Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN 978-2-9517408-8-4. ELRA. 2014. or on the corpora web page: http://ufal.mff.cuni.cz/hindencorp Please cite this paper if you make any use of the corpora. BibTeX citation format below. Common Properties ----------------- All the files are plain text: - compressed with gzip - encoded in UTF-8 - with unix line breaks (LF) - with tab-delimited columns The monolingual and parallel corpora have different columns. The actual corpus text is stored . . .
- Name
- hindencorp05.export.gz
- Size
- 43.34 MB
- Format
- application/x-gzip
- Description
- HindEnCorp 0.5 in sentence-parallel tokenized format with automatic morphological tags and lemmas
- MD5
- 192ca33c840826a78832280839ba3628
- Name
- hindencorp05.plaintext.gz
- Size
- 22.79 MB
- Format
- application/x-gzip
- Description
- HindEnCorp 0.5 in sentence-parallel plain text format
- MD5
- 512d754320c445bd9eb5c4912fee6844