English-Urdu Religious Parallel Corpus

English-Urdu Religious Parallel Corpus

LINDAT / CLARIAH-CZ

Authors: Jawaid, Bushra and Zeman, Daniel

Item identifier: http://hdl.handle.net/11234/1-2582

Project URL: http://ufal.mff.cuni.cz/umc/005-en-ur/

Referenced by: https://ufal.mff.cuni.cz/pbml/95/art-jawaid-zeman.pdf

Date issued: 2010

Type: corpus, text

Size: 14371 sentences

Language(s): English , Urdu

Description: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Subject(s): parallel corpus religious text machine translation

Collection(s): LINDAT / CLARIAH-CZ Data & Tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: en-ur-parallel-corpus.zip
Size: 3.51 MB
Format: application/zip
Description: Unknown
MD5: 8440be07c883b4c0289961ba577a634b

Download file Preview

File Preview

bible
- test.ur59 kB
- Bible-UR1 MB
- dev.en41 kB
- train.ur1 MB
- dev.ur65 kB
- test.en39 kB
- Bible-EN956 kB
- Bible-UR-normalized1 MB
- train.en875 kB
quran
- test.ur24 kB
- Quran-EN1 MB
- dev.en16 kB
- train.ur1 MB
- Quran-UR1 MB
- Quran-UR-normalized1 MB
- dev.ur23 kB
- test.en16 kB
- train.en1 MB