HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats ================================================ This file describes the file formats of the Hindi-English and Hindi-only corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5. More details about the preparation of the corpora can be found in the paper: Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN 978-2-9517408-8-4. ELRA. 2014. or on the corpora web page: http://ufal.mff.cuni.cz/hindencorp Please cite this paper if you make any use of the corpora. BibTeX citation format below. Common Properties ----------------- All the files are plain text: - compressed with gzip - encoded in UTF-8 - with unix line breaks (LF) - with tab-delimited columns The monolingual and parallel corpora have different columns. The actual corpus text is stored in one (monolingual corpus) or two (parallel corpus) of the columns. Plaintext vs. Export File Format -------------------------------- Both the monolingual and the parallel corpus come in a simple plain text format and in a tokenized, tagged and lemmatized format. The plaintext format preserves the original tokenization (as much as possible given the diverse sources included in our corpus). The 'export' format is tokenized and represents each token as a '|'-delimited triple of: the word form, the lemma, and part-of-speech tag. If there was the character '|' (this character is also used instead of the proper Devanagari Danda in some sources), we escape it as '&pipe;'. There is exactly the same number of lines in the plaintext and export file formats. HindEnCorp Columns ------------------ The files hindencorp05.plaintext.gz and hindencorp05.export.gz each contain the parallel corpus and differ only in the processing of the corpus texts. The files have these columns: - source identifier (where do the segments come from) - alignment type (number of English segments - number of Hindi segments) - alignment quality, which is one of the following: "manual" ... for sources that were sentence-aligned manually "implied" ... for sources where one side was constructed by translating segment by segment float ... a value somehow reflecting the goodness of the automatic alignment; not really reliable - English segment or segments - Hindi segment or segments Each of the segments field is in the plaintext or export format as described above. If there are more than one segments on a line (e.g. for lines with alignment type 2-1 where there are two English segments), then the segments are delimited with '' in the text field. HindMonoCorp Columns -------------------- The files hindmonocorp05.plaintext.gz and hindmonocorp05.export.gz each contain the monolingual corpus and differ only in the processing of the corpus text. Each input segment (usually one Hindi sentence) is stored on a separate line. The files have these columns: - source identifier (where does the segment come from) - segment type, which is one of the following: ... this segment is a sentence from a 'text body' ... this segment comes from a 'headline' (e.g. of an article) ... anything, the source here does not allow to distinguish between body and headlines - Hindi segment BibTeX Citation for HindEnCorp and HindMonoCorp 0.5 --------------------------------------------------- @InProceedings{hindencorp05:lrec:2014, author = {Ond{\v{r}}ej Bojar and Vojt{\v{e}}ch Diatka and Pavel Rychl{\'{y}} and Pavel Stra{\v{n}}{\'{a}}k and V{\'{\i}}t Suchomel and Ale{\v{s}} Tamchyna and Daniel Zeman}, title = "{HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation}", booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }