HindEnCorp 0.5

Name: HindEnCorp 0.5
License: http://creativecommons.org/licenses/by-nc-sa/3.0/

Bojar, Ondřej; Diatka, Vojtěch; Straňák, Pavel; Tamchyna, Aleš; Zeman, Daniel

dc.contributor.author	Bojar, Ondřej
dc.contributor.author	Diatka, Vojtěch
dc.contributor.author	Straňák, Pavel
dc.contributor.author	Tamchyna, Aleš
dc.contributor.author	Zeman, Daniel
dc.date.accessioned	2014-03-21T22:24:57Z
dc.date.available	2014-03-21T22:24:57Z
dc.date.issued	2014-03-21
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
dc.description	HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008). Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi. EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages. Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi. TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available. The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus. Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files. Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
dc.description.sponsorship	LM2010013,
dc.language.iso	hin
dc.language.iso	eng
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.replaces	http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject	parallel corpus
dc.subject	English-Hindi parallel corpus
dc.subject	sentence-parallel
dc.title	HindEnCorp 0.5
dc.type	corpus
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Bojar
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Ondřej
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Charles University in Prague, UFAL
metashare.ResourceInfo#DistributionInfo.availability	notAvailable
metashare.ResourceInfo#ContentInfo.mediaType	text
metashare.ResourceInfo#TextInfo#SizeInfo.size	132 300
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	sentences
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	bojar@ufal.mff.cuni.cz
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2010013 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
size.info	132300 sentences
files.size	69346282
files.count	3
featuredService.kontext	English-Hindi\|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=hindencorp_05_en_m
featuredService.kontext	Hindi-English\|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=hindencorp_05_hi_m

Files in this item

Download all files in item (66.13 MB)

This item is

Publicly Available

and licensed under:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Name: README.txt
Size: 4.37 KB
Format: Text file
Description: Brief description of corpus formats
MD5: 0461668ddf034e11de3958528b64962f

Download file Preview

File Preview

HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats
================================================

This file describes the file formats of the Hindi-English and Hindi-only
corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5.

More details about the preparation of the corpora can be found in the paper:

  Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna
  and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for
  Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN
  978-2-9517408-8-4. ELRA. 2014.

or on the corpora web page:
  http://ufal.mff.cuni.cz/hindencorp

Please cite this paper if you make any use of the corpora. BibTeX citation
format below.


Common Properties
-----------------

All the files are plain text:

- compressed with gzip
- encoded in UTF-8
- with unix line breaks (LF)
- with tab-delimited columns

The monolingual and parallel corpora have different columns.

The actual corpus text is stored . . .

Name: hindencorp05.export.gz
Size: 43.34 MB
Format: application/x-gzip
Description: HindEnCorp 0.5 in sentence-parallel tokenized format with automatic morphological tags and lemmas
MD5: 192ca33c840826a78832280839ba3628

Download file

Name: hindencorp05.plaintext.gz
Size: 22.79 MB
Format: application/x-gzip
Description: HindEnCorp 0.5 in sentence-parallel plain text format
MD5: 512d754320c445bd9eb5c4912fee6844

Download file

Show simple item record