dc.contributor.author | Abdi Khojasteh, Hadi |
dc.contributor.author | Ansari, Ebrahim |
dc.contributor.author | Bohlouli, Mahdi |
dc.date.accessioned | 2020-03-18T10:44:43Z |
dc.date.available | 2020-03-18T10:44:43Z |
dc.date.issued | 2020-02-02 |
dc.identifier.uri | http://hdl.handle.net/11234/1-3195 |
dc.description | "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI). |
dc.language.iso | fas |
dc.language.iso | eng |
dc.language.iso | deu |
dc.language.iso | ces |
dc.language.iso | ita |
dc.language.iso | hin |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.publisher | Institute for Advanced Studies in Basic Sciences (IASBS) |
dc.relation.isreferencedby | https://arxiv.org/abs/2003.06499 |
dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
dc.source.uri | https://iasbs.ac.ir/~ansari/lscp/ |
dc.subject | PoS tagging |
dc.subject | corpus |
dc.subject | annotated corpus |
dc.subject | multilingual |
dc.subject | derivation |
dc.subject | dependency parser |
dc.subject | machine translation |
dc.subject | informal language |
dc.subject | spoken language |
dc.subject | monolingual corpus |
dc.subject | bilingual corpus annotation |
dc.title | Large-Scale Colloquial Persian 0.5 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Ebrahim Ansari ansari@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
contact.person | Ebrahim Ansari ansari@iasbs.ac.ir Institute for Advanced Studies in Basic Sciences (IASBS) |
contact.person | Hadi Abdi Khojasteh hadiabdikhojasteh@gmail.com Institute for Advanced Studies in Basic Sciences (IASBS) |
sponsor | Czech Science Foundation 19-26934X Neural Representations in Multi-modal and Multi-lingual Modelling nationalFunds |
size.info | 120000000 sentences |
size.info | 19.6 gb |
files.size | 2669457416 |
files.count | 9 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Publicly Available
Licence: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
- Název
- README.md
- Velikost
- 6.04 KB
- Formát
- Neznámý
- Popis
- readme
- MD5
- 7cdb63dc4bf1038fbe132fd3234b0efd
- Název
- lscp-0.5-fa.7z
- Velikost
- 378.04 MB
- Formát
- Neznámý
- Popis
- Persian - Monolingual Corpus
- MD5
- 5eba07bcf2b644a41f2c52e00d1fd61c
- Název
- lscp-0.5-fa-normalized.7z
- Velikost
- 328.79 MB
- Formát
- Neznámý
- Popis
- Persian - Normalized Monolingual Corpus
- MD5
- 77efc113f976acd561fa363b2ec676c8
- Název
- lscp-0.5-fa-derivation-tree.7z
- Velikost
- 502.63 MB
- Formát
- Neznámý
- Popis
- Persian - Derivation Tree
- MD5
- 641ce73575dd03c361e9be61ba909a29
- Název
- lscp-0.5-fa-cs.7z
- Velikost
- 256.74 MB
- Formát
- Neznámý
- Popis
- Persian-Czech - Bilingual Corpus
- MD5
- aa17ea609dd9f953632ab12baca868a7
- Název
- lscp-0.5-fa-en.7z
- Velikost
- 229.74 MB
- Formát
- Neznámý
- Popis
- Persian-English - Bilingual Corpus
- MD5
- ecdd2400df014b3f7cc6671567fdb93a
- Název
- lscp-0.5-fa-de.7z
- Velikost
- 277.34 MB
- Formát
- Neznámý
- Popis
- Persian-German - Bilingual Corpus
- MD5
- 6d773a7b71420715d9d22d49d1e9b671
- Název
- lscp-0.5-fa-it.7z
- Velikost
- 269.64 MB
- Formát
- Neznámý
- Popis
- Persian-Italian - Bilingual Corpus
- MD5
- 9d0b506edd6bc03d0b2f95c5142145e3
- Název
- lscp-0.5-fa-hi.7z
- Velikost
- 302.86 MB
- Formát
- Neznámý
- Popis
- Persian-Hindi - Bilingual Corpus
- MD5
- 2c522644a47f2b1c6eaff1edc8730ec4