README ------ This resource contains the annotated corpora and evaluation scripts for the [PARSEME Shared Task on Verbal Multi-Word Expressions Identification - edition 1.2 (2020)](http://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_40_Shared_Task). This resource contains corpora in multiple languages. Corpora were annotated by human annotators with occurrences of verbal multiword expressions (VMWEs) according to common [annotation guidelines](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2/). This archival version was generated from our [gitlab repository](https://gitlab.com/parseme/sharedtask-data/) - [commit a8e210c3](https://gitlab.com/parseme/sharedtask-data/commit/a8e210c33891fbd89fb2ce6bc441b5fce8a95836) Corpora ------- Annotations can be found for multiple language directories, such as `FR` for French, `DE` for German, `PL` for Polish, etc. Inside each language directory, you may find these files: * `README.md`: A description of the available data for the given language. * `train.cupt`: Training data in cupt format. * `dev.cupt`: Development data in cupt format. * `test.blind.cupt`: Blind test data in cupt format (released to participants on April 30, 2018) * `test.cupt`: Gold test data in cupt format (released to participants after the shared task evaluation phase, on May 11, 2018) * `{train,test,dev}-stats.md`: Number of sentences, tokens and annotated VMWEs in each part of the corpus. Note: For some languages, some fields may contain data that does not use the [Universal Dependencies](http://universaldependencies.org/) tagsets. Format ------- The [cupt format](http://multiword.sourceforge.net/cupt-format) is an extension of the [CoNLL-U](http://universaldependencies.org/format.html) format containing the original 10 columns from CoNLL-U + 1 additional column `PARSEME:MWE` containing VMWE annotations. Depending on the language, different types of information are available in the first 10 columns (check the language-specific `README.md` files). The `PARSEME:MWE` column encodes information about VMWEs present in a sentence. It is very similar to the fourth column of the 2017 [parsemetsv format](https://typo.uni-konstanz.de/parseme/index.php/2-general/184-parseme-shared-task-format-of-the-final-annotation): * It contains a star `*` if the word in the current line is not part of a VMWE, or if the current line describes a multiword tokens (e.g. _2-3 don't_). * It contains an underscore `_` if this information is underspecified (e.g. in the blind test corpus). * It contains a list of semicolon-separated _VMWE codes_ if the current word is part of one or more VMWEs. VMWE codes are only assigned to the lexicalized components of a VMWE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the annotation guidelines). * If the current line contains the **first** lexicalized component of the VMWE in the sentence, the VMWE code consists of a _VMWE identifier_ followed by a colon `:` and a _VMWE category label_, for example: `1:VID` * VMWE identifiers are integers starting from 1 for each new sentence, and increased by 1 for each new VMWE. * VMWE category labels are strings corresponding to the category of the VMWE (see [VMWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ) in the annotation guidelines). The following VMWE category labels are allowed in shared task 1.1: * `LVC.full`: light-verb constructions in which the verb only adds meaning expressed as morphological features, for example: _to **give** a **lecture**_ (previously LVC). * `LVC.cause`: light-verb constructions in which the verb adds a causative meaning to the noun, for example: _to **grant rights**_ (new) * `VID`: verbal idioms, for example: _to **go bananas**_ (previously ID). * `IRV`: inherently reflexive verbs, for example: _to **help oneself** to the cookies_ (previously IReflV). * `VPC.full`: fully non-compositional verb-particle constructions in which the particle totally changes the meaning of the verb, for example: _to **do in**_ (previously VPC). * `VPC.semi`: semi-compositional verb-particle constructions, in which the particle adds a partly predictable but non-spatial meaning to the verb, for example: _to **eat up**_ (new). * `MVC`: multi-verb constructions, for example: _to **make do**_ (new). * `IAV`: inherently adpositional verbs, for example: _to **come across**_ (new). * If the current line contains a lexicalized component of the VMWE which is **not the first** one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label. Scripts ---------- The `bin` directory contains useful scripts to evaluate and verify the system predictions. * `bin/evaluate.py`: script that assesses system predictions according to gold annotations by calculating the [evaluation metrics](http://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_04_LAW-MWE-CxG_2018___lb__COLING__rb__&subpage=CONF_50_Evaluation_metrics) * `bin/validate_cupt.py`: script for checking if the prediction file is in the proper CUPT format. * `bin/average_of_evaluations.py`: script used to calculate cross-lingual macro-averages for general and phenomenon-specific metrics. * `bin/tsvlib_usage_example.py`: example script for those who want to use `tsvlib.py` to develop software manipulating CUPT files. Trial data ---------- The `trial` directory contains trial data. All files have toy sizes (true files are much larger) and all but the last are provided in [CUPT format](http://multiword.sourceforge.net/cupt-format). Note that, while these trial files are in English, English is not part of the shared task edition 1.2. * `EN-trial.train.cupt`: Example of a training corpus in English with manually annotated VMWEs * `EN-trial.test.blind.cupt`: Example of a blind version of the test data; systems should take such a file on input and provide automatic predictions of VMWEs in column 11 * `EN-trial.test.pred.cupt`: Sample system predictions of VMWEs in English * `EN-trial.test.cupt`: Example of a gold version of the test data; system predictions will be compared to this gold version * `EN-trial.raw.conllu`: Example of a raw parsed corpus for extracting unknown VMWEs, in [CoNLL-U format](https://universaldependencies.org/format.html)