HindMonoCorp 0.5

Name: HindMonoCorp 0.5
License: http://creativecommons.org/licenses/by-nc-sa/3.0/
Keywords: corpus

Bojar, Ondřej; Diatka, Vojtěch; Rychlý, Pavel; Straňák, Pavel; Suchomel, Vít; Tamchyna, Aleš; Zeman, Daniel

dc.contributor.author	Bojar, Ondřej
dc.contributor.author	Diatka, Vojtěch
dc.contributor.author	Rychlý, Pavel
dc.contributor.author	Straňák, Pavel
dc.contributor.author	Suchomel, Vít
dc.contributor.author	Tamchyna, Aleš
dc.contributor.author	Zeman, Daniel
dc.date.accessioned	2014-03-21T22:36:19Z
dc.date.available	2014-03-21T22:36:19Z
dc.date.issued	2014-03-21
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
dc.description	Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014.
dc.description.sponsorship	LM2010013,
dc.language.iso	hin
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.replaces	http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject	corpus
dc.title	HindMonoCorp 0.5
dc.type	corpus
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Bojar
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Ondřej
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Charles University in Prague, UFAL
metashare.ResourceInfo#DistributionInfo.availability	notAvailable
metashare.ResourceInfo#ContentInfo.mediaType	text
metashare.ResourceInfo#TextInfo#SizeInfo.size	365000000
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	tokens
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	bojar@ufal.mff.cuni.cz
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2010013 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
size.info	365000000 tokens
files.size	7365223829
files.count	3
featuredService.kontext	search\|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=hindmonocorp_05_m

Files in this item

This item is

Publicly Available

and licensed under:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Name: README.txt
Size: 4.37 KB
Format: Text file
Description: Brief description of corpus formats
MD5: 0461668ddf034e11de3958528b64962f

Download file Preview

File Preview

HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats
================================================

This file describes the file formats of the Hindi-English and Hindi-only
corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5.

More details about the preparation of the corpora can be found in the paper:

  Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna
  and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for
  Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN
  978-2-9517408-8-4. ELRA. 2014.

or on the corpora web page:
  http://ufal.mff.cuni.cz/hindencorp

Please cite this paper if you make any use of the corpora. BibTeX citation
format below.


Common Properties
-----------------

All the files are plain text:

- compressed with gzip
- encoded in UTF-8
- with unix line breaks (LF)
- with tab-delimited columns

The monolingual and parallel corpora have different columns.

The actual corpus text is stored . . .

Name: hindmonocorp05.plaintext.gz
Size: 2.3 GB
Format: application/x-gzip
Description: HindMonoCorp 0.5 segmented into sentences in plain text format
MD5: c9b693573af7fcfbc99b7d4234a30838

Download file

Name: hindmonocorp05.export.gz
Size: 4.56 GB
Format: application/x-gzip
Description: HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas
MD5: cabcd337b2fe81792ee386e63a3060f5

Download file

Show simple item record