dc.contributor.author | Jawaid, Bushra |
dc.contributor.author | Kamran, Amir |
dc.contributor.author | Bojar, Ondřej |
dc.date.accessioned | 2014-03-27T15:41:35Z |
dc.date.available | 2014-03-27T15:41:35Z |
dc.date.issued | 2014-03-22 |
dc.identifier.uri | http://hdl.handle.net/11858/00-097C-0000-0023-65A9-5 |
dc.description | We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora. |
dc.description.sponsorship | it is supported by the MosesCore project sponsored by the European Commission’s Seventh Framework Programme (Grant Number 288487). |
dc.language.iso | urd |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation | info:eu-repo/grantAgreement/EC/FP7/288487 |
dc.rights | Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/3.0/ |
dc.subject | Urdu |
dc.subject | monolingual data |
dc.subject | annotated data |
dc.subject | corpus |
dc.title | Urdu Monolingual Corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContactInfo#PersonInfo.surname | Jawaid |
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName | Bushra |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName | Charles University in Prague, UFAL |
metashare.ResourceInfo#DistributionInfo.availability | notAvailable |
metashare.ResourceInfo#ContentInfo.mediaType | text |
metashare.ResourceInfo#TextInfo#SizeInfo.size | 5464575 |
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit | sentences |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email | bushrajd84@hotmail.com |
metashare.ResourceInfo#ContentInfo.detailedType | other |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
sponsor | European Union FP7-ICT-2011-7-288487 MosesCore euFunds info:eu-repo/grantAgreement/EC/FP7/288487 |
size.info | 5464575 sentences |
files.size | 490897522 |
files.count | 4 |
featuredService.kontext | search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=urducorp_ur_m |
Soubory tohoto záznamu
Stáhnout všechny soubory záznamu (468.16 MB)Licenční kategorie:
Licence: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Publicly Available
Licence: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
- Název
- urdu-tagged-corpus.gz
- Velikost
- 253.82 MB
- Formát
- application/x-gzip
- Popis
- Urdu Monolingual Tagged Corpus
- MD5
- 63d61d9ebae592598c41a6746ec9938b
- Název
- urdu-plain-text-corpus.gz
- Velikost
- 213.46 MB
- Formát
- application/x-gzip
- Popis
- Urdu Monolingual Plain Text Corpus
- MD5
- 100b1db9efd403ee677683b3268084d9
- Název
- urmono-lrec-2014.pdf
- Velikost
- 152.86 KB
- Formát
- Popis
- Urdu data description
- MD5
- 528b61b0dd860aff9e3fe8d9b3c31b80
- Název
- cleaning-tools.tar.gz
- Velikost
- 748.74 KB
- Formát
- application/x-gzip
- Popis
- Cleaning tools
- MD5
- 469377de9bbb6f900a2322547d2566d8
- cleaning-tools
- del_sentences_with_missing_spaces.pl879 B
- detectLanguage.pl1 kB
- filter_arabic_sentences.pl619 B
- del_invalid_utf8.pl417 B
- README796 B
- remove_repeated_chars.pl1 kB
- tok-dan.pl1 kB
- remove_sindhi_sentences.pl857 B
- detect_en_sentence.pl440 B
- langfeatures.dat3 MB
- convert-urNum-to-enNum.pl754 B
- clean-corpus.sh4 kB