This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Urdu Monolingual Corpus

Please use the following text to cite this item or export to a predefined format:
Jawaid, Bushra; Kamran, Amir and Bojar, Ondřej, 2014, Urdu Monolingual Corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0023-65A9-5.
Date issued
2014-03-22
Size
5464575 sentences
Language(s)
Description
We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
urdu-tagged-corpus.gz
Size
253.82 MB
Format
application/x-gzip
Description
gzip Archive
MD5
63d61d9ebae592598c41a6746ec9938b
Preview
  File Preview
Name
urdu-plain-text-corpus.gz
Size
213.46 MB
Format
application/x-gzip
Description
gzip Archive
MD5
100b1db9efd403ee677683b3268084d9
Preview
  File Preview
Name
urmono-lrec-2014.pdf
Size
152.86 KB
Format
application/pdf
Description
Adobe PDF
MD5
528b61b0dd860aff9e3fe8d9b3c31b80
Preview
  File Preview
Name
cleaning-tools.tar.gz
Size
748.74 KB
Format
application/x-gzip
Description
gzip Archive
MD5
469377de9bbb6f900a2322547d2566d8
Preview
  File Preview