This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Annotated corpora and tools of the PARSEME Shared Task : Subtask 1, MWE Identification (edition 2.0)

Please use the following text to cite this item or export to a predefined format:
Savary, Agata; et al., 2026, Annotated corpora and tools of the PARSEME Shared Task : Subtask 1, MWE Identification (edition 2.0), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-6123.
Authors
show everyone
Date issued
2026-03-06
Size
4,884,449 tokens,
253,643 sentences,
141,570 multiWordUnits
Description
This multilingual resource contains corpora in which multiword expressions (MWEs) of all syntactic categories have been manually annotated. While previous editions covered only verbal MWEs, this version extends the annotation scope to all syntactic MWE categories: verbal, nominal, adjectival, adverbial and functional. This release of the corpora is associated to the PARSEME 2.0 Multilingual Shared Task on Identification and Paraphrasing of Multiword Expressions. It corresponds to the data used in subtask 1, on MWE identification. The data covers 17 languages, of which 7 are new with respect to previous PARSEME editions. The annotation process is based on cross-lingually unified guidelines, phrased as decision diagrams over linguistic tests, and a typology of 18 MWE categories. The corpus contains almost 5 million tokens, over 250,000 sentences and 140,000 MWE annotations. MWEs were annotated according to the universal guidelines. The corpora are provided in the CUPT format, inspired by the CONLL-U format. Morphological and syntactic information, ­­­­including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME 2.0 Shared Task - subtask 1. This release includes the test data, which was kept secret from participants during the evaluation phase. The annotation guidelines are available online: https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0 The CUPT format is detailed here: https://gitlab.com/parseme/corpora/-/wikis/CUPT-format Reference ------- When referring to this release, please, cite: * Agata Savary, Manon Scholivet, Carlos Ramisch, Takuya Nakamura, Eric Bilinski, Sara Stymne, Voula Giouli, Stella Markantonatou, Vasile Păiș, Maria Mitrofan, Louis Estève, Bruno Guillaume, Verginica Barbu Mititelu, Jaka Čibej, Roberto A. Díaz Hernández, Victoria Fendel, Polona Gantar, Olha Kanishcheva, Cvetana Krstev, Chaya Liebeskind, Irina Lobzhanidze, Aleksandra Marković, Gunta Nešpore-Bērzkalne, Adriana Pagano, Mehrnoush Shamsfard, Ranka Stanković, Vahide Tajalli, Carole Tiberius, Aakanksha Padhye (2026), _PARSEME 2.0 multilingual corpus of multiword expressions_, in Proceedings of LREC 2026, ELRA, Palma de Mallorca, Spain.
Publisher
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
EGY.zip
Size
364.47 KB
Format
application/zip
Description
MD5
f73fc659add9982731375f85b9ad837c
Preview
  File Preview
  • EGY
    • total-stats.md334 B
    • dev-stats.md190 B
    • README.md5 kB
    • train-stats.md310 B
    • dev.cupt52 kB
    • test-stats.md338 B
    • train.cupt469 kB
    • test.cupt1 MB
Name
EL.zip
Size
1005.1 KB
Format
application/zip
Description
MD5
3aafe0bcc02a02fb5f27968e999afee4
Preview
  File Preview
  • EL
    • total-stats.md329 B
    • dev-stats.md222 B
    • README.md9 kB
    • train-stats.md285 B
    • dev.cupt365 kB
    • test-stats.md291 B
    • train.cupt3 MB
    • test.cupt2 MB
Name
FA.zip
Size
875.67 KB
Format
application/zip
Description
MD5
dea48c77035c6b6c6ee529c3a1328516
Preview
  File Preview
  • FA
    • total-stats.md482 B
    • dev-stats.md386 B
    • README.md4 kB
    • train-stats.md487 B
    • dev.cupt414 kB
    • test-stats.md369 B
    • train.cupt3 MB
    • test.cupt479 kB
Name
FR.zip
Size
1.05 MB
Format
application/zip
Description
MD5
8ba924bfb5b1f528eec55c91fbee683a
Preview
  File Preview
  • FR
    • total-stats.md364 B
    • dev-stats.md308 B
    • README.md6 kB
    • train-stats.md368 B
    • dev.cupt608 kB
    • test-stats.md267 B
    • train.cupt5 MB
    • test.cupt561 kB
Name
GRC.zip
Size
286.83 KB
Format
application/zip
Description
MD5
40ee6036459bb6587433d763d78abf12
Preview
  File Preview
  • GRC
    • total-stats.md252 B
    • README.md2 kB
    • test-stats.md256 B
    • test.cupt1 MB
Name
HE.zip
Size
5.06 MB
Format
application/zip
Description
MD5
8aa3e6767439b67732dae8607e19e159
Preview
  File Preview
  • HE
    • total-stats.md432 B
    • dev-stats.md352 B
    • README.md7 kB
    • train-stats.md437 B
    • dev.cupt2 MB
    • test-stats.md311 B
    • train.cupt24 MB
    • test.cupt1 MB
Name
JA.zip
Size
999.38 KB
Format
application/zip
Description
MD5
7bb9047255c3b4ad0d057b68bd453e7a
Preview
  File Preview
  • JA
    • total-stats.md346 B
    • dev-stats.md248 B
    • README.md2 kB
    • train-stats.md351 B
    • dev.cupt624 kB
    • test-stats.md283 B
    • train.cupt5 MB
    • test.cupt1 MB
Name
KA.zip
Size
19.34 MB
Format
application/zip
Description
MD5
3d91441ae32dc7a4703c9ea04f5b2db7
Preview
  File Preview
  • KA
    • total-stats.md239 B
    • dev-stats.md202 B
    • README.md6 kB
    • train-stats.md212 B
    • dev.cupt6 MB
    • test-stats.md237 B
    • train.cupt56 MB
    • test.cupt80 MB
Name
LV.zip
Size
4.33 MB
Format
application/zip
Description
MD5
c247f72e924b6efa1ca642c0adf91ec0
Preview
  File Preview
  • LV
    • total-stats.md436 B
    • dev-stats.md346 B
    • README.md3 kB
    • train-stats.md419 B
    • dev.cupt2 MB
    • test-stats.md386 B
    • train.cupt20 MB
    • test.cupt3 MB
Name
NL.zip
Size
118.92 KB
Format
application/zip
Description
MD5
60ad031333c7e79a569439a4d49e1dd2
Preview
  File Preview
  • NL
    • total-stats.md439 B
    • dev-stats.md178 B
    • README.md3 kB
    • train-stats.md323 B
    • dev.cupt10 kB
    • test-stats.md442 B
    • train.cupt112 kB
    • test.cupt544 kB
Name
PL.zip
Size
7.34 MB
Format
application/zip
Description
MD5
2e4e1fadc19054d336c571109c9f35e3
Preview
  File Preview
  • PL
    • total-stats.md395 B
    • dev-stats.md313 B
    • README.md13 kB
    • train-stats.md400 B
    • dev.cupt4 MB
    • test-stats.md310 B
    • train.cupt39 MB
    • test.cupt1 MB
Name
PT.zip
Size
385.63 KB
Format
application/zip
Description
MD5
2a76074c66e75d58e3fe8c96520ef211
Preview
  File Preview
  • PT
    • total-stats.md367 B
    • dev-stats.md194 B
    • README.md8 kB
    • train-stats.md328 B
    • dev.cupt60 kB
    • test-stats.md338 B
    • train.cupt585 kB
    • test.cupt1 MB
Name
RO.zip
Size
17.44 MB
Format
application/zip
Description
MD5
5676b9f37b4017e13c2eaffa6ecae7d6
Preview
  File Preview
  • RO
    • total-stats.md499 B
    • dev-stats.md444 B
    • README.md5 kB
    • train-stats.md504 B
    • dev.cupt10 MB
    • test-stats.md317 B
    • train.cupt94 MB
    • test.cupt776 kB
Name
SL.zip
Size
2.82 MB
Format
application/zip
Description
MD5
4d5ce20742a22381b18d8ba4cac10720
Preview
  File Preview
  • SL
    • total-stats.md517 B
    • dev-stats.md382 B
    • README.md5 kB
    • train-stats.md520 B
    • dev.cupt1 MB
    • test-stats.md404 B
    • train.cupt13 MB
    • test.cupt1 MB
Name
SR.zip
Size
2.62 MB
Format
application/zip
Description
MD5
635421eb6b47262c37d442a14e4535a4
Preview
  File Preview
  • SR
    • total-stats.md383 B
    • dev-stats.md315 B
    • README.md3 kB
    • train-stats.md387 B
    • dev.cupt1 MB
    • test-stats.md290 B
    • train.cupt13 MB
    • test.cupt826 kB
Name
SV.zip
Size
1.34 MB
Format
application/zip
Description
MD5
f88f2d348c7ad7efab41be468caa6a29
Preview
  File Preview
  • SV
    • total-stats.md498 B
    • dev-stats.md448 B
    • README.md6 kB
    • train-stats.md501 B
    • dev.cupt721 kB
    • test-stats.md468 B
    • train.cupt6 MB
    • test.cupt1 MB
Name
UK.zip
Size
2.82 MB
Format
application/zip
Description
MD5
51a5334d07f099ede1b7f19b4fea0ffa
Preview
  File Preview
  • UK
    • total-stats.md498 B
    • dev-stats.md395 B
    • README.md3 kB
    • train-stats.md502 B
    • dev.cupt1 MB
    • test-stats.md360 B
    • train.cupt14 MB
    • test.cupt1 MB
Name
trial.zip
Size
28.21 KB
Format
application/zip
Description
MD5
a5c7cc9f41b1f1dcb4205989c1f4763e
Preview
  File Preview
  • trial
    • README.md2 kB
    • FR
      • total-stats.md236 B
      • trial.train.cupt12 kB
      • trial.test.system.cupt12 kB
      • trial.test.cupt12 kB
      • trial.train-stats.md213 B
      • trial.test.blind.cupt12 kB
      • trial.test-stats.md228 B
    • EN
      • total-stats.md267 B
      • trial.train.cupt8 kB
      • trial.test.system.cupt9 kB
      • trial.test.cupt9 kB
      • trial.train-stats.md249 B
      • trial.test.blind.cupt9 kB
      • trial.test-stats.md204 B
Name
tools.zip
Size
620.02 KB
Format
application/zip
Description
MD5
57b8fc20503fd94e7ac9e0ed9448f193
Preview
  File Preview
  • tools
    • parseme_evaluate.py36 kB
    • LICENSE34 kB
    • bmc_munkres
      • LICENSE561 B
      • README.md1 kB
      • munkres.py23 kB
    • README.md6 kB
    • valid_parseme
      • mwe.json543 B
      • languages.code54 B
    • average_of_evaluations.py7 kB
    • parseme_validate.py32 kB
    • valid_ud
      • data
        • deprel.shopen311 B
        • edeprel.ta2 kB
        • tokens_w_space.ud27 B
        • edeprel.ar27 kB
        • tokens_w_space.br443 B
        • tokens_w_space.koi86 B
        • docdeps.json259 kB
        • tokens_w_space.kk859 B
        • tokens_w_space.nl592 B
        • cpos.ud79 B
        • edeprels.json833 kB
        • tokens_w_space.hit72 B
        • edeprel.lt2 kB
        • tokens_w_space.am82 B
        • deprels.json783 kB
        • tokens_w_space.sv131 B
        • tokens_w_space.myv82 B
        • tokens_w_space.mdf82 B
        • tokens_w_space.pl366 B
        • feats.json1 MB
        • edeprel.uk5 kB
        • tokens_w_space.akk246 B
        • tokens_w_space.ja844 B
        • tokens_w_space.vi3 B
        • deprel.ud237 B
        • data.json463 kB
        • tokens_w_space.fro10 B
        • tokens_w_space.kpv86 B
        • tokens_w_space.shopen9 B
        • tokens_w_space.kmr277 B
        • feat_val.shopen2 kB
        • docfeats.json1 MB
        • tokens_w_space.fr10 B
        • tokens_w_space.lv178 B
        • tokens_w_space.sms151 B
        • tokens_w_space.lt1 kB
        • tokens_w_space.sjo2 kB
        • README.md808 B
        • tokens_w_space.apu123 B
        • tokens_w_space.fi78 B
        • tokens_w_space.sga86 B
      • validate.py184 kB
Name
README
Size
5.15 KB
Format
application/octet-stream
Description
MD5
1328c1f1abd4134c1595893d068ed24c
Preview
  File Preview