This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

HindMonoCorp 0.5

Please use the following text to cite this item or export to a predefined format:
Bojar, Ondřej; et al., 2014, HindMonoCorp 0.5, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0023-6260-A.
Date issued
2014-03-21
Size
365000000 tokens
Language(s)
Description
Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014.
Acknowledgement
Subject(s)

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2*
2014-03-21 00:00:00
2011-11-23 00:00:00
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
hindmonocorp05.plaintext.gz
Size
2.3 GB
Format
application/x-gzip
Description
gzip Archive
MD5
c9b693573af7fcfbc99b7d4234a30838
Preview
  File Preview
Name
README.txt
Size
4.37 KB
Format
text/plain
Description
Text
MD5
0461668ddf034e11de3958528b64962f
Preview
  File Preview
    HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats
    ================================================
    
    This file describes the file formats of the Hindi-English and Hindi-only
    corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5.
    
    More details about the preparation of the corpora can be found in the paper:
    
      Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna
      and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for
      Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN
      978-2-9517408-8-4. ELRA. 2014.
    
    or on the corpora web page:
      http://ufal.mff.cuni.cz/hindencorp
    
    Please cite this paper if you make any use of the corpora. BibTeX citation
    format below.
    
    
    Common Properties
    -----------------
    
    All the files are plain text:
    
    - compressed with gzip
    - encoded in UTF-8
    - with unix line breaks (LF)
    - with tab-delimited columns
    
    The monolingual and parallel corpora have different columns.
    
    The actual corpus text is stored in one (monolingual corpus) or two (parallel
    corpus) of the columns.
    
    
    Plaintext vs. Export File Format
    --------------------------------
    
    Both the monolingual and the parallel corpus come in a simple plain text format
    and in a tokenized, tagged and lemmatized format.
    
    The plaintext format preserves the original tokenization (as much as possible
    given the diverse sources included in our corpus).
    
    The 'export' format is tokenized and represents each token as a '|'-delimited
    triple of: the word form, the lemma, and part-of-speech tag. If there was the
    character '|' (this character is also used instead of the proper Devanagari
    Danda in some sources), we escape it as '&pipe;'.
    
    There is exactly the same number of lines in the plaintext and export file
    formats.
    
    
    HindEnCorp Columns
    ------------------
    
    The files hindencorp05.plaintext.gz and hindencorp05.export.gz each contain the
    parallel corpus and differ only in the processing of the corpus texts. The
    files have these columns:
    
    - sou . . .
Name
hindmonocorp05.export.gz
Size
4.56 GB
Format
application/x-gzip
Description
gzip Archive
MD5
cabcd337b2fe81792ee386e63a3060f5
Preview
  File Preview