Show simple item record

 
dc.contributor.author Abdi Khojasteh, Hadi
dc.contributor.author Ansari, Ebrahim
dc.contributor.author Bohlouli, Mahdi
dc.date.accessioned 2020-03-18T10:44:43Z
dc.date.available 2020-03-18T10:44:43Z
dc.date.issued 2020-02-02
dc.identifier.uri http://hdl.handle.net/11234/1-3195
dc.description "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
dc.language.iso fas
dc.language.iso eng
dc.language.iso deu
dc.language.iso ces
dc.language.iso ita
dc.language.iso hin
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher Institute for Advanced Studies in Basic Sciences (IASBS)
dc.relation.isreferencedby https://arxiv.org/abs/2003.06499
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source.uri https://iasbs.ac.ir/~ansari/lscp/
dc.subject PoS tagging
dc.subject corpus
dc.subject annotated corpus
dc.subject multilingual
dc.subject derivation
dc.subject dependency parser
dc.subject machine translation
dc.subject informal language
dc.subject spoken language
dc.subject monolingual corpus
dc.subject bilingual corpus annotation
dc.title Large-Scale Colloquial Persian 0.5
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Ebrahim Ansari ansari@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
contact.person Ebrahim Ansari ansari@iasbs.ac.ir Institute for Advanced Studies in Basic Sciences (IASBS)
contact.person Hadi Abdi Khojasteh hadiabdikhojasteh@gmail.com Institute for Advanced Studies in Basic Sciences (IASBS)
sponsor Czech Science Foundation 19-26934X Neural Representations in Multi-modal and Multi-lingual Modelling nationalFunds
size.info 120000000 sentences
size.info 19.6 gb
files.size 2669457416
files.count 9


 Files in this item

Icon
Name
README.md
Size
6.04 KB
Format
Unknown
Description
readme
MD5
7cdb63dc4bf1038fbe132fd3234b0efd
 Download file
Icon
Name
lscp-0.5-fa.7z
Size
378.04 MB
Format
Unknown
Description
Persian - Monolingual Corpus
MD5
5eba07bcf2b644a41f2c52e00d1fd61c
 Download file
Icon
Name
lscp-0.5-fa-normalized.7z
Size
328.79 MB
Format
Unknown
Description
Persian - Normalized Monolingual Corpus
MD5
77efc113f976acd561fa363b2ec676c8
 Download file
Icon
Name
lscp-0.5-fa-derivation-tree.7z
Size
502.63 MB
Format
Unknown
Description
Persian - Derivation Tree
MD5
641ce73575dd03c361e9be61ba909a29
 Download file
Icon
Name
lscp-0.5-fa-cs.7z
Size
256.74 MB
Format
Unknown
Description
Persian-Czech - Bilingual Corpus
MD5
aa17ea609dd9f953632ab12baca868a7
 Download file
Icon
Name
lscp-0.5-fa-en.7z
Size
229.74 MB
Format
Unknown
Description
Persian-English - Bilingual Corpus
MD5
ecdd2400df014b3f7cc6671567fdb93a
 Download file
Icon
Name
lscp-0.5-fa-de.7z
Size
277.34 MB
Format
Unknown
Description
Persian-German - Bilingual Corpus
MD5
6d773a7b71420715d9d22d49d1e9b671
 Download file
Icon
Name
lscp-0.5-fa-it.7z
Size
269.64 MB
Format
Unknown
Description
Persian-Italian - Bilingual Corpus
MD5
9d0b506edd6bc03d0b2f95c5142145e3
 Download file
Icon
Name
lscp-0.5-fa-hi.7z
Size
302.86 MB
Format
Unknown
Description
Persian-Hindi - Bilingual Corpus
MD5
2c522644a47f2b1c6eaff1edc8730ec4
 Download file

Show simple item record