This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

Please use the following text to cite this item or export to a predefined format:
Savary, Agata; et al., 2023, PARSEME corpora annotated for verbal multiword expressions (version 1.3), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-5124.
Authors
show everyone
Date issued
2023-05-10
Size
455629 sentences,
9264811 tokens,
127498 multiWordUnits
Description
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). This is the first release of the corpora without an associated shared task. Previous version (1.2) was associated with the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). The data covers 26 languages corresponding to the combination of the corpora for all previous three editions (1.0, 1.1 and 1.2) of the corpora. VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information, ­­­­including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME Shared Task 1.2. The annotation guidelines are available online: https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3 The .cupt format is detailed here: https://multiword.sourceforge.net/cupt-format/
Publisher
This item isPublicly Available
and licensed under:
 Files in this item
Name
HI.tgz
Size
469.3 KB
Format
application/x-gzip
Description
Hindi files
MD5
1d8dbf79b80326f797d517f3f993d04d
Preview
  File Preview
    • HI.tgz3 MB
    • HI.tgz3 MB
Name
ES.tgz
Size
2.09 MB
Format
application/x-gzip
Description
Spanish files
MD5
588b050f3cd655d1dd6df000b0d702da
Preview
  File Preview
    • ES.tgz11 MB
    • ES.tgz11 MB
Name
EL.tgz
Size
10.6 MB
Format
application/x-gzip
Description
Greek files
MD5
125789048de3a0ee764cc3d9f34bc854
Preview
  File Preview
    • EL.tgz63 MB
    • EL.tgz63 MB
Name
AR.tgz
Size
10.78 MB
Format
application/x-gzip
Description
Arabic files
MD5
73fe213c348928f5eb49a635a6f02a01
Preview
  File Preview
    • AR.tgz48 MB
    • AR.tgz48 MB
Name
GA.tgz
Size
494.12 KB
Format
application/x-gzip
Description
Irish files
MD5
cb2b193f7ce5bd60a77ba55efbd8232f
Preview
  File Preview
    • GA.tgz2 MB
    • GA.tgz2 MB
Name
EU.tgz
Size
2.02 MB
Format
application/x-gzip
Description
Basque files
MD5
5b9d3da6fcdce7e800b1c1ea07eb6ef1
Preview
  File Preview
    • EU.tgz11 MB
    • EU.tgz11 MB
Name
FA.tgz
Size
703.09 KB
Format
application/x-gzip
Description
Farsi files
MD5
d0459becd9d685241b241384ec79ad57
Preview
  File Preview
    • FA.tgz3 MB
    • FA.tgz3 MB
Name
DE.tgz
Size
2.25 MB
Format
application/x-gzip
Description
German files
MD5
eaee4a615ce4abd74aab58ea72d5c12e
Preview
  File Preview
    • DE.tgz12 MB
    • DE.tgz12 MB
Name
BG.tgz
Size
6.48 MB
Format
application/x-gzip
Description
Bulgarian files
MD5
7ccee1056d5621a9b509cf727a678525
Preview
  File Preview
    • BG.tgz41 MB
    • BG.tgz41 MB
Name
EN.tgz
Size
1.59 MB
Format
application/x-gzip
Description
English files
MD5
b8c356eefeb174e0984f6c7b1188dba9
Preview
  File Preview
    • EN.tgz6 MB
    • EN.tgz6 MB
Name
CS.tgz
Size
12.86 MB
Format
application/x-gzip
Description
Czech files
MD5
9fe9764dc970e2c646049533a81ccda6
Preview
  File Preview
    • CS.tgz81 MB
    • CS.tgz81 MB
Name
HU.tgz
Size
1.88 MB
Format
application/x-gzip
Description
Hungarian files
MD5
a1153a044795ee7a9151e0ad2f9e25c1
Preview
  File Preview
    • HU.tgz11 MB
    • HU.tgz11 MB
Name
FR.tgz
Size
6.12 MB
Format
application/x-gzip
Description
French files
MD5
755009c7e5ba96e74cedc14ec802eb2b
Preview
  File Preview
    • FR.tgz33 MB
    • FR.tgz33 MB
Name
HE.tgz
Size
5.26 MB
Format
application/x-gzip
Description
Hebrew files
MD5
f2e883e1a108a3888fb2628d769b9c3c
Preview
  File Preview
    • HE.tgz28 MB
    • HE.tgz28 MB
Name
HR.tgz
Size
1.98 MB
Format
application/x-gzip
Description
Croatian files
MD5
951cd6b5948ee8e1aa6a9a4a8bf41336
Preview
  File Preview
    • HR.tgz10 MB
    • HR.tgz10 MB
Name
IT.tgz
Size
4.67 MB
Format
application/x-gzip
Description
Italian files
MD5
565fb5c73667b4ac55e8aacf20680501
Preview
  File Preview
    • IT.tgz23 MB
    • IT.tgz23 MB
Name
LT.tgz
Size
2.98 MB
Format
application/x-gzip
Description
Lithuanian files
MD5
8f94517eebae1216e80ea6effc97a91a
Preview
  File Preview
    • LT.tgz17 MB
    • LT.tgz17 MB
Name
MT.tgz
Size
2.78 MB
Format
application/x-gzip
Description
Maltese files
MD5
12ee7b2105eeac324386c859a7ef7816
Preview
  File Preview
    • MT.tgz12 MB
    • MT.tgz12 MB
Name
PL.tgz
Size
6.99 MB
Format
application/x-gzip
Description
Polish files
MD5
bdae0922e513f36c000b47360980ffc9
Preview
  File Preview
    • PL.tgz40 MB
    • PL.tgz40 MB
Name
PT.tgz
Size
7.59 MB
Format
application/x-gzip
Description
Portuguese files
MD5
2c96f436546787f976e20a2022abf516
Preview
  File Preview
    • PT.tgz37 MB
    • PT.tgz37 MB
Name
RO.tgz
Size
12.33 MB
Format
application/x-gzip
Description
Romanian files
MD5
7efcbd0b9902d925c11f014b6ccd3c18
Preview
  File Preview
    • RO.tgz74 MB
    • RO.tgz74 MB
Name
TR.tgz
Size
4.55 MB
Format
application/x-gzip
Description
Turkish files
MD5
1c36bfd64fba1d93f9deca35e3272ed1
Preview
  File Preview
    • TR.tgz27 MB
    • TR.tgz27 MB
Name
SV.tgz
Size
1.44 MB
Format
application/x-gzip
Description
Swedish files
MD5
5c71eb09a2bb773b21141a13e8e40a88
Preview
  File Preview
    • SV.tgz8 MB
    • SV.tgz8 MB
Name
ZH.tgz
Size
9.61 MB
Format
application/x-gzip
Description
Chinese files
MD5
362b4150e0fda49a0915130bc85a6712
Preview
  File Preview
    • ZH.tgz38 MB
    • ZH.tgz38 MB
Name
SR.tgz
Size
1.11 MB
Format
application/x-gzip
Description
Serbian files
MD5
0ad8cad8ca462ea837445d2166bc722a
Preview
  File Preview
    • SR.tgz6 MB
    • SR.tgz6 MB
Name
SL.tgz
Size
8.35 MB
Format
application/x-gzip
Description
Slovenian files
MD5
6933ab467e6bef5e52d0656075e42618
Preview
  File Preview
    • SL.tgz45 MB
    • SL.tgz45 MB
Name
README.md
Size
7.08 KB
Format
application/octet-stream
Description
General README file
MD5
5902de46b35f82c79183b20d67ab13de
Preview
  File Preview