Files in this item

This item is
Publicly Available
and licensed under:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Distributed under Creative Commons Attribution Required Noncommercial Share Alike
Icon
Name
README.txt
Size
4.37 KB
Format
Text file
Description
Brief description of corpus formats
MD5
0461668ddf034e11de3958528b64962f
 Download file  Preview
 File Preview  
HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats ================================================ This file describes the file formats of the Hindi-English and Hindi-only corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5. More details about the preparation of the corpora can be found in the paper: Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN 978-2-9517408-8-4. ELRA. 2014. or on the corpora web page: http://ufal.mff.cuni.cz/hindencorp Please cite this paper if you make any use of the corpora. BibTeX citation format below. Common Properties ----------------- All the files are plain text: - compressed with gzip - encoded in UTF-8 - with unix line breaks (LF) - with tab-delimited columns The monolingual and parallel corpora have different columns. The actual corpus text is stored . . .
Icon
Name
hindmonocorp05.plaintext.gz
Size
2.3 GB
Format
application/x-gzip
Description
HindMonoCorp 0.5 segmented into sentences in plain text format
MD5
c9b693573af7fcfbc99b7d4234a30838
 Download file
Icon
Name
hindmonocorp05.export.gz
Size
4.56 GB
Format
application/x-gzip
Description
HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas
MD5
cabcd337b2fe81792ee386e63a3060f5
 Download file