English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following:
1- Manually corrected sentence alignment of the corpora.
2- Our data split (training-development-test) so that our published experiments can be reproduced.
3- Tokenization (optional, but needed to reproduce our experiments).
4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
THE LINDAT/CLARIN PROJECT (LM2015071 and CZ.02.1.01/0.0/0.0/16_013/0001781; formerly LM2010013) IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".
Copyright (c) 2018 UFAL MFF UK. All rights reserved.