Files in this item
This item is
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Publicly Available
and licensed under:Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
- Name
- README.txt
- Size
- 4.37 KB
- Format
- Text file
- Description
- Brief description of corpus formats
- MD5
- 0461668ddf034e11de3958528b64962f
HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats ================================================ This file describes the file formats of the Hindi-English and Hindi-only corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5. More details about the preparation of the corpora can be found in the paper: Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN 978-2-9517408-8-4. ELRA. 2014. or on the corpora web page: http://ufal.mff.cuni.cz/hindencorp Please cite this paper if you make any use of the corpora. BibTeX citation format below. Common Properties ----------------- All the files are plain text: - compressed with gzip - encoded in UTF-8 - with unix line breaks (LF) - with tab-delimited columns The monolingual and parallel corpora have different columns. The actual corpus text is stored . . .
- Name
- hindmonocorp05.plaintext.gz
- Size
- 2.3 GB
- Format
- application/x-gzip
- Description
- HindMonoCorp 0.5 segmented into sentences in plain text format
- MD5
- c9b693573af7fcfbc99b7d4234a30838
- Name
- hindmonocorp05.export.gz
- Size
- 4.56 GB
- Format
- application/x-gzip
- Description
- HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas
- MD5
- cabcd337b2fe81792ee386e63a3060f5