Soubory tohoto záznamu

 Stáhnout všechny soubory záznamu (66.13 MB)
Licenční kategorie:
Publicly Available

Licence: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Distributed under Creative Commons Attribution Required Noncommercial Share Alike
Icon
Název
README.txt
Velikost
4.37 KB
Formát
Textový soubor
Popis
Brief description of corpus formats
MD5
0461668ddf034e11de3958528b64962f
 Stáhnout soubor  Náhled
 Náhled souboru  
HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats
================================================

This file describes the file formats of the Hindi-English and Hindi-only
corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5.

More details about the preparation of the corpora can be found in the paper:

  Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna
  and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for
  Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN
  978-2-9517408-8-4. ELRA. 2014.

or on the corpora web page:
  http://ufal.mff.cuni.cz/hindencorp

Please cite this paper if you make any use of the corpora. BibTeX citation
format below.


Common Properties
-----------------

All the files are plain text:

- compressed with gzip
- encoded in UTF-8
- with unix line breaks (LF)
- with tab-delimited columns

The monolingual and parallel corpora have different columns.

The actual corpus text is stored . . .
                                            
Icon
Název
hindencorp05.export.gz
Velikost
43.34 MB
Formát
application/x-gzip
Popis
HindEnCorp 0.5 in sentence-parallel tokenized format with automatic morphological tags and lemmas
MD5
192ca33c840826a78832280839ba3628
 Stáhnout soubor
Icon
Název
hindencorp05.plaintext.gz
Velikost
22.79 MB
Formát
application/x-gzip
Popis
HindEnCorp 0.5 in sentence-parallel plain text format
MD5
512d754320c445bd9eb5c4912fee6844
 Stáhnout soubor