dc.contributor.author | Jawaid, Bushra |
dc.contributor.author | Zeman, Daniel |
dc.date.accessioned | 2018-01-05T15:38:19Z |
dc.date.available | 2018-01-05T15:38:19Z |
dc.date.issued | 2010 |
dc.identifier.uri | http://hdl.handle.net/11234/1-2582 |
dc.description | English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics. |
dc.language.iso | eng |
dc.language.iso | urd |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation.isreferencedby | https://ufal.mff.cuni.cz/pbml/95/art-jawaid-zeman.pdf |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.source.uri | http://ufal.mff.cuni.cz/umc/005-en-ur/ |
dc.subject | parallel corpus |
dc.subject | religious text |
dc.subject | machine translation |
dc.title | English-Urdu Religious Parallel Corpus |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Bushra Jawaid bushrajd84@hotmail.com Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky MSM0021620838 Modern Methods, Structures and Systems of Computer Science nationalFunds |
sponsor | Grantová agentura České republiky GAP406/11/1499 Čeština ve věku strojového překladu nationalFunds |
sponsor | Charles University in Prague SVV 261314/2010 SVV 261 314 Other |
size.info | 14371 sentences |
files.size | 3683565 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Name
- en-ur-parallel-corpus.zip
- Size
- 3.51 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 8440be07c883b4c0289961ba577a634b