Annotated corpora and tools of the PARSEME Shared Task : Subtask 1, MWE Identification (edition 2.0)
Please use the following text to cite this item or export to a predefined format:
Savary, Agata; et al., 2026,
Annotated corpora and tools of the PARSEME Shared Task : Subtask 1, MWE Identification (edition 2.0), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11372/LRT-6123.
Authors
Savary, Agata ; et al.
Item identifier
Project URL
Date issued
2026-03-06
Size
4,884,449 tokens,
253,643 sentences,
141,570 multiWordUnits
Description
This multilingual resource contains corpora in which multiword expressions (MWEs) of all syntactic categories have been manually annotated. While previous editions covered only verbal MWEs, this version extends the annotation scope to all syntactic MWE categories: verbal, nominal, adjectival, adverbial and functional. This release of the corpora is associated to the PARSEME 2.0 Multilingual Shared Task on Identification and Paraphrasing of Multiword Expressions. It corresponds to the data used in subtask 1, on MWE identification. The data covers 17 languages, of which 7 are new with respect to previous PARSEME editions. The annotation process is based on cross-lingually unified guidelines, phrased as decision diagrams over linguistic tests, and a typology of 18 MWE categories. The corpus contains almost 5 million tokens, over 250,000 sentences and 140,000 MWE annotations. MWEs were annotated according to the universal guidelines. The corpora are provided in the CUPT format, inspired by the CONLL-U format. Morphological and syntactic information, including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME 2.0 Shared Task - subtask 1. This release includes the test data, which was kept secret from participants during the evaluation phase. The annotation guidelines are available online: https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0 The CUPT format is detailed here: https://gitlab.com/parseme/corpora/-/wikis/CUPT-format
Reference
-------
When referring to this release, please, cite:
* Agata Savary, Manon Scholivet, Carlos Ramisch, Takuya Nakamura, Eric Bilinski, Sara Stymne, Voula Giouli, Stella Markantonatou, Vasile Păiș, Maria Mitrofan, Louis Estève, Bruno Guillaume, Verginica Barbu Mititelu, Jaka Čibej, Roberto A. Díaz Hernández, Victoria Fendel, Polona Gantar, Olha Kanishcheva, Cvetana Krstev, Chaya Liebeskind, Irina Lobzhanidze, Aleksandra Marković, Gunta Nešpore-Bērzkalne, Adriana Pagano, Mehrnoush Shamsfard, Ranka Stanković, Vahide Tajalli, Carole Tiberius, Aakanksha Padhye (2026), _PARSEME 2.0 multilingual corpus of multiword expressions_, in Proceedings of LREC 2026, ELRA, Palma de Mallorca, Spain.
Publisher
Acknowledgement
COST
Project code:CA21167
Project name:UniDive
Collections
Version History
Files in this item
- Name
- EGY.zip
- Size
- 364.47 KB
- Format
- application/zip
- Description
- MD5
- f73fc659add9982731375f85b9ad837c

- EGY
- total-stats.md334 B
- dev-stats.md190 B
- README.md5 kB
- train-stats.md310 B
- dev.cupt52 kB
- test-stats.md338 B
- train.cupt469 kB
- test.cupt1 MB
- Name
- EL.zip
- Size
- 1005.1 KB
- Format
- application/zip
- Description
- MD5
- 3aafe0bcc02a02fb5f27968e999afee4

- EL
- total-stats.md329 B
- dev-stats.md222 B
- README.md9 kB
- train-stats.md285 B
- dev.cupt365 kB
- test-stats.md291 B
- train.cupt3 MB
- test.cupt2 MB
- Name
- FA.zip
- Size
- 875.67 KB
- Format
- application/zip
- Description
- MD5
- dea48c77035c6b6c6ee529c3a1328516

- FA
- total-stats.md482 B
- dev-stats.md386 B
- README.md4 kB
- train-stats.md487 B
- dev.cupt414 kB
- test-stats.md369 B
- train.cupt3 MB
- test.cupt479 kB
- Name
- FR.zip
- Size
- 1.05 MB
- Format
- application/zip
- Description
- MD5
- 8ba924bfb5b1f528eec55c91fbee683a

- FR
- total-stats.md364 B
- dev-stats.md308 B
- README.md6 kB
- train-stats.md368 B
- dev.cupt608 kB
- test-stats.md267 B
- train.cupt5 MB
- test.cupt561 kB
- Name
- GRC.zip
- Size
- 286.83 KB
- Format
- application/zip
- Description
- MD5
- 40ee6036459bb6587433d763d78abf12

- GRC
- total-stats.md252 B
- README.md2 kB
- test-stats.md256 B
- test.cupt1 MB
- Name
- HE.zip
- Size
- 5.06 MB
- Format
- application/zip
- Description
- MD5
- 8aa3e6767439b67732dae8607e19e159

- HE
- total-stats.md432 B
- dev-stats.md352 B
- README.md7 kB
- train-stats.md437 B
- dev.cupt2 MB
- test-stats.md311 B
- train.cupt24 MB
- test.cupt1 MB
- Name
- JA.zip
- Size
- 999.38 KB
- Format
- application/zip
- Description
- MD5
- 7bb9047255c3b4ad0d057b68bd453e7a

- JA
- total-stats.md346 B
- dev-stats.md248 B
- README.md2 kB
- train-stats.md351 B
- dev.cupt624 kB
- test-stats.md283 B
- train.cupt5 MB
- test.cupt1 MB
- Name
- KA.zip
- Size
- 19.34 MB
- Format
- application/zip
- Description
- MD5
- 3d91441ae32dc7a4703c9ea04f5b2db7

- KA
- total-stats.md239 B
- dev-stats.md202 B
- README.md6 kB
- train-stats.md212 B
- dev.cupt6 MB
- test-stats.md237 B
- train.cupt56 MB
- test.cupt80 MB
- Name
- LV.zip
- Size
- 4.33 MB
- Format
- application/zip
- Description
- MD5
- c247f72e924b6efa1ca642c0adf91ec0

- LV
- total-stats.md436 B
- dev-stats.md346 B
- README.md3 kB
- train-stats.md419 B
- dev.cupt2 MB
- test-stats.md386 B
- train.cupt20 MB
- test.cupt3 MB
- Name
- NL.zip
- Size
- 118.92 KB
- Format
- application/zip
- Description
- MD5
- 60ad031333c7e79a569439a4d49e1dd2

- NL
- total-stats.md439 B
- dev-stats.md178 B
- README.md3 kB
- train-stats.md323 B
- dev.cupt10 kB
- test-stats.md442 B
- train.cupt112 kB
- test.cupt544 kB
- Name
- PL.zip
- Size
- 7.34 MB
- Format
- application/zip
- Description
- MD5
- 2e4e1fadc19054d336c571109c9f35e3

- PL
- total-stats.md395 B
- dev-stats.md313 B
- README.md13 kB
- train-stats.md400 B
- dev.cupt4 MB
- test-stats.md310 B
- train.cupt39 MB
- test.cupt1 MB
- Name
- PT.zip
- Size
- 385.63 KB
- Format
- application/zip
- Description
- MD5
- 2a76074c66e75d58e3fe8c96520ef211

- PT
- total-stats.md367 B
- dev-stats.md194 B
- README.md8 kB
- train-stats.md328 B
- dev.cupt60 kB
- test-stats.md338 B
- train.cupt585 kB
- test.cupt1 MB
- Name
- RO.zip
- Size
- 17.44 MB
- Format
- application/zip
- Description
- MD5
- 5676b9f37b4017e13c2eaffa6ecae7d6

- RO
- total-stats.md499 B
- dev-stats.md444 B
- README.md5 kB
- train-stats.md504 B
- dev.cupt10 MB
- test-stats.md317 B
- train.cupt94 MB
- test.cupt776 kB
- Name
- SL.zip
- Size
- 2.82 MB
- Format
- application/zip
- Description
- MD5
- 4d5ce20742a22381b18d8ba4cac10720

- SL
- total-stats.md517 B
- dev-stats.md382 B
- README.md5 kB
- train-stats.md520 B
- dev.cupt1 MB
- test-stats.md404 B
- train.cupt13 MB
- test.cupt1 MB
- Name
- SR.zip
- Size
- 2.62 MB
- Format
- application/zip
- Description
- MD5
- 635421eb6b47262c37d442a14e4535a4

- SR
- total-stats.md383 B
- dev-stats.md315 B
- README.md3 kB
- train-stats.md387 B
- dev.cupt1 MB
- test-stats.md290 B
- train.cupt13 MB
- test.cupt826 kB
- Name
- SV.zip
- Size
- 1.34 MB
- Format
- application/zip
- Description
- MD5
- f88f2d348c7ad7efab41be468caa6a29

- SV
- total-stats.md498 B
- dev-stats.md448 B
- README.md6 kB
- train-stats.md501 B
- dev.cupt721 kB
- test-stats.md468 B
- train.cupt6 MB
- test.cupt1 MB
- Name
- UK.zip
- Size
- 2.82 MB
- Format
- application/zip
- Description
- MD5
- 51a5334d07f099ede1b7f19b4fea0ffa

- UK
- total-stats.md498 B
- dev-stats.md395 B
- README.md3 kB
- train-stats.md502 B
- dev.cupt1 MB
- test-stats.md360 B
- train.cupt14 MB
- test.cupt1 MB
- Name
- trial.zip
- Size
- 28.21 KB
- Format
- application/zip
- Description
- MD5
- a5c7cc9f41b1f1dcb4205989c1f4763e

- trial
- README.md2 kB
- FR
- total-stats.md236 B
- trial.train.cupt12 kB
- trial.test.system.cupt12 kB
- trial.test.cupt12 kB
- trial.train-stats.md213 B
- trial.test.blind.cupt12 kB
- trial.test-stats.md228 B
- EN
- total-stats.md267 B
- trial.train.cupt8 kB
- trial.test.system.cupt9 kB
- trial.test.cupt9 kB
- trial.train-stats.md249 B
- trial.test.blind.cupt9 kB
- trial.test-stats.md204 B
- Name
- tools.zip
- Size
- 620.02 KB
- Format
- application/zip
- Description
- MD5
- 57b8fc20503fd94e7ac9e0ed9448f193

- tools
- parseme_evaluate.py36 kB
- LICENSE34 kB
- bmc_munkres
- LICENSE561 B
- README.md1 kB
- munkres.py23 kB
- README.md6 kB
- valid_parseme
- mwe.json543 B
- languages.code54 B
- average_of_evaluations.py7 kB
- parseme_validate.py32 kB
- valid_ud
- data
- deprel.shopen311 B
- edeprel.ta2 kB
- tokens_w_space.ud27 B
- edeprel.ar27 kB
- tokens_w_space.br443 B
- tokens_w_space.koi86 B
- docdeps.json259 kB
- tokens_w_space.kk859 B
- tokens_w_space.nl592 B
- cpos.ud79 B
- edeprels.json833 kB
- tokens_w_space.hit72 B
- edeprel.lt2 kB
- tokens_w_space.am82 B
- deprels.json783 kB
- tokens_w_space.sv131 B
- tokens_w_space.myv82 B
- tokens_w_space.mdf82 B
- tokens_w_space.pl366 B
- feats.json1 MB
- edeprel.uk5 kB
- tokens_w_space.akk246 B
- tokens_w_space.ja844 B
- tokens_w_space.vi3 B
- deprel.ud237 B
- data.json463 kB
- tokens_w_space.fro10 B
- tokens_w_space.kpv86 B
- tokens_w_space.shopen9 B
- tokens_w_space.kmr277 B
- feat_val.shopen2 kB
- docfeats.json1 MB
- tokens_w_space.fr10 B
- tokens_w_space.lv178 B
- tokens_w_space.sms151 B
- tokens_w_space.lt1 kB
- tokens_w_space.sjo2 kB
- README.md808 B
- tokens_w_space.apu123 B
- tokens_w_space.fi78 B
- tokens_w_space.sga86 B
- validate.py184 kB
- data
- Name
- README
- Size
- 5.15 KB
- Format
- application/octet-stream
- Description
- MD5
- 1328c1f1abd4134c1595893d068ed24c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

