Urdu Monolingual Corpus

Name: Urdu Monolingual Corpus
License: http://creativecommons.org/licenses/by-nc-sa/3.0/

Jawaid, Bushra; Kamran, Amir; Bojar, Ondřej

dc.contributor.author	Jawaid, Bushra
dc.contributor.author	Kamran, Amir
dc.contributor.author	Bojar, Ondřej
dc.date.accessioned	2014-03-27T15:41:35Z
dc.date.available	2014-03-27T15:41:35Z
dc.date.issued	2014-03-22
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-0023-65A9-5
dc.description	We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora.
dc.description.sponsorship	it is supported by the MosesCore project sponsored by the European Commission’s Seventh Framework Programme (Grant Number 288487).
dc.language.iso	urd
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation	info:eu-repo/grantAgreement/EC/FP7/288487
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject	Urdu
dc.subject	monolingual data
dc.subject	annotated data
dc.subject	corpus
dc.title	Urdu Monolingual Corpus
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Jawaid
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Bushra
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Charles University in Prague, UFAL
metashare.ResourceInfo#DistributionInfo.availability	notAvailable
metashare.ResourceInfo#ContentInfo.mediaType	text
metashare.ResourceInfo#TextInfo#SizeInfo.size	5464575
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	sentences
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	bushrajd84@hotmail.com
metashare.ResourceInfo#ContentInfo.detailedType	other
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
sponsor	European Union FP7-ICT-2011-7-288487 MosesCore euFunds info:eu-repo/grantAgreement/EC/FP7/288487
size.info	5464575 sentences
files.size	490897522
files.count	4
featuredService.kontext	search\|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=urducorp_ur_m

Soubory tohoto záznamu

Stáhnout všechny soubory záznamu (468.16 MB)

Licenční kategorie:

Publicly Available

Licence: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Název: urdu-tagged-corpus.gz
Velikost: 253.82 MB
Formát: application/x-gzip
Popis: Urdu Monolingual Tagged Corpus
MD5: 63d61d9ebae592598c41a6746ec9938b

Stáhnout soubor

Název: urdu-plain-text-corpus.gz
Velikost: 213.46 MB
Formát: application/x-gzip
Popis: Urdu Monolingual Plain Text Corpus
MD5: 100b1db9efd403ee677683b3268084d9

Stáhnout soubor

Název: urmono-lrec-2014.pdf
Velikost: 152.86 KB
Formát: PDF
Popis: Urdu data description
MD5: 528b61b0dd860aff9e3fe8d9b3c31b80

Stáhnout soubor

Název: cleaning-tools.tar.gz
Velikost: 748.74 KB
Formát: application/x-gzip
Popis: Cleaning tools
MD5: 469377de9bbb6f900a2322547d2566d8

Stáhnout soubor Náhled

Náhled souboru

cleaning-tools
- del_sentences_with_missing_spaces.pl879 B
- detectLanguage.pl1 kB
- filter_arabic_sentences.pl619 B
- del_invalid_utf8.pl417 B
- README796 B
- remove_repeated_chars.pl1 kB
- tok-dan.pl1 kB
- remove_sindhi_sentences.pl857 B
- detect_en_sentence.pl440 B
- langfeatures.dat3 MB
- convert-urNum-to-enNum.pl754 B
- clean-corpus.sh4 kB

Zobrazit minimální záznam