Zobrazit minimální záznam

 
dc.contributor.author Jawaid, Bushra
dc.contributor.author Kamran, Amir
dc.contributor.author Bojar, Ondřej
dc.date.accessioned 2014-03-27T15:41:35Z
dc.date.available 2014-03-27T15:41:35Z
dc.date.issued 2014-03-22
dc.identifier.uri http://hdl.handle.net/11858/00-097C-0000-0023-65A9-5
dc.description We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora.
dc.description.sponsorship it is supported by the MosesCore project sponsored by the European Commission’s Seventh Framework Programme (Grant Number 288487).
dc.language.iso urd
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/FP7/288487
dc.rights Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject Urdu
dc.subject monolingual data
dc.subject annotated data
dc.subject corpus
dc.title Urdu Monolingual Corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContactInfo#PersonInfo.surname Jawaid
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName Bushra
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName Charles University in Prague, UFAL
metashare.ResourceInfo#DistributionInfo.availability notAvailable
metashare.ResourceInfo#ContentInfo.mediaType text
metashare.ResourceInfo#TextInfo#SizeInfo.size 5464575
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit sentences
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email bushrajd84@hotmail.com
metashare.ResourceInfo#ContentInfo.detailedType other
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
sponsor European Union FP7-ICT-2011-7-288487 MosesCore euFunds info:eu-repo/grantAgreement/EC/FP7/288487
size.info 5464575 sentences
files.size 490897522
files.count 4
featuredService.kontext search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=urducorp_ur_m


 Soubory tohoto záznamu

 Stáhnout všechny soubory záznamu (468.16 MB)
Licenční kategorie:
Publicly Available

Licence: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Distributed under Creative Commons Attribution Required Noncommercial Share Alike
Icon
Název
urdu-tagged-corpus.gz
Velikost
253.82 MB
Formát
application/x-gzip
Popis
Urdu Monolingual Tagged Corpus
MD5
63d61d9ebae592598c41a6746ec9938b
 Stáhnout soubor
Icon
Název
urdu-plain-text-corpus.gz
Velikost
213.46 MB
Formát
application/x-gzip
Popis
Urdu Monolingual Plain Text Corpus
MD5
100b1db9efd403ee677683b3268084d9
 Stáhnout soubor
Icon
Název
urmono-lrec-2014.pdf
Velikost
152.86 KB
Formát
PDF
Popis
Urdu data description
MD5
528b61b0dd860aff9e3fe8d9b3c31b80
 Stáhnout soubor
Icon
Název
cleaning-tools.tar.gz
Velikost
748.74 KB
Formát
application/x-gzip
Popis
Cleaning tools
MD5
469377de9bbb6f900a2322547d2566d8
 Stáhnout soubor  Náhled
 Náhled souboru  
  • cleaning-tools
    • del_sentences_with_missing_spaces.pl879 B
    • detectLanguage.pl1 kB
    • filter_arabic_sentences.pl619 B
    • del_invalid_utf8.pl417 B
    • README796 B
    • remove_repeated_chars.pl1 kB
    • tok-dan.pl1 kB
    • remove_sindhi_sentences.pl857 B
    • detect_en_sentence.pl440 B
    • langfeatures.dat3 MB
    • convert-urNum-to-enNum.pl754 B
    • clean-corpus.sh4 kB

Zobrazit minimální záznam