Show simple item record

 
dc.contributor.author Jawaid, Bushra
dc.contributor.author Zeman, Daniel
dc.date.accessioned 2018-01-05T15:38:19Z
dc.date.available 2018-01-05T15:38:19Z
dc.date.issued 2010
dc.identifier.uri http://hdl.handle.net/11234/1-2582
dc.description English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
dc.language.iso eng
dc.language.iso urd
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby https://ufal.mff.cuni.cz/pbml/95/art-jawaid-zeman.pdf
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri http://ufal.mff.cuni.cz/umc/005-en-ur/
dc.subject parallel corpus
dc.subject religious text
dc.subject machine translation
dc.title English-Urdu Religious Parallel Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Bushra Jawaid bushrajd84@hotmail.com Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky MSM0021620838 Modern Methods, Structures and Systems of Computer Science nationalFunds
sponsor Grantová agentura České republiky GAP406/11/1499 Čeština ve věku strojového překladu nationalFunds
sponsor Charles University in Prague SVV 261314/2010 SVV 261 314 Other
size.info 14371 sentences
files.size 3683565
files.count 1


 Files in this item

Icon
Name
en-ur-parallel-corpus.zip
Size
3.51 MB
Format
application/zip
Description
Unknown
MD5
8440be07c883b4c0289961ba577a634b
 Download file  Preview
 File Preview  
  • bible
    • test.ur59 kB
    • Bible-UR1 MB
    • dev.en41 kB
    • train.ur1 MB
    • dev.ur65 kB
    • test.en39 kB
    • Bible-EN956 kB
    • Bible-UR-normalized1 MB
    • train.en875 kB
  • quran
    • test.ur24 kB
    • Quran-EN1 MB
    • dev.en16 kB
    • train.ur1 MB
    • Quran-UR1 MB
    • Quran-UR-normalized1 MB
    • dev.ur23 kB
    • test.en16 kB
    • train.en1 MB

Show simple item record