README ------ This resource contains the annotated corpora and evaluation scripts for the [PARSEME Shared Task on Verbal Multi-Word Expressions Identification - edition 1.1 (2018)](http://multiword.ef.net/sharedtask2018). This resource contains corpora in multiple languages. Corpora were annotated by human annotators with occurrences of verbal multiword expressions (VMWEs) according to common [annotation guidelines](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/). This archival version was generated from our [gitlab repository](https://gitlab.com/parseme/sharedtask-data/) - [commit 94bc0a1c](https://gitlab.com/parseme/sharedtask-data/commit/94bc0a1c72be08fbc5f0a5e88693d069aa04961e) Corpora ------- Annotations can be found for multiple language directories, such as `FR` for French, `DE` for German, `PL` for Polish, etc. Inside each language directory, you may find these files: * `README.md`: A description of the available data for the given language. * `train.cupt`: Training data in cupt format. * `dev.cupt`: Development data in cupt format. * `test.blind.cupt`: Blind test data in cupt format (released to participants on April 30, 2018) * `test.cupt`: Gold test data in cupt format (released to participants after the shared task evaluation phase, on May 11, 2018) * `{train,test,dev}-stats.md`: Number of sentences, tokens and annotated VMWEs in each part of the corpus. Note: For some languages, some fields may contain data that does not use the [Universal Dependencies](http://universaldependencies.org/) tagsets. Format ------- The [cupt format](http://multiword.sourceforge.net/cupt-format) is an extension of the [CoNLL-U](http://universaldependencies.org/format.html) format containing the original 10 columns from CoNLL-U + 1 additional column `PARSEME:MWE` containing VMWE annotations. Depending on the language, different types of information are available in the first 10 columns (check the language-specific `README.md` files). The `PARSEME:MWE` column encodes information about VMWEs present in a sentence. It is very similar to the fourth column of the 2017 [parsemetsv format](https://typo.uni-konstanz.de/parseme/index.php/2-general/184-parseme-shared-task-format-of-the-final-annotation): * It contains a star `*` if the word in the current line is not part of a VMWE, or if the current line describes a multiword tokens (e.g. _2-3 don't_). * It contains an underscore `_` if this information is underspecified (e.g. in the blind test corpus). * It contains a list of semicolon-separated _VMWE codes_ if the current word is part of one or more VMWEs. VMWE codes are only assigned to the lexicalized components of a VMWE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the annotation guidelines). * If the current line contains the **first** lexicalized component of the VMWE in the sentence, the VMWE code consists of a _VMWE identifier_ followed by a colon `:` and a _VMWE category label_, for example: `1:VID` * VMWE identifiers are integers starting from 1 for each new sentence, and increased by 1 for each new VMWE. * VMWE category labels are strings corresponding to the category of the VMWE (see [VMWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ) in the annotation guidelines). The following VMWE category labels are allowed in shared task 1.1: * `LVC.full`: light-verb constructions in which the verb only adds meaning expressed as morphological features, for example: _to **give** a **lecture**_ (previously LVC). * `LVC.cause`: light-verb constructions in which the verb adds a causative meaning to the noun, for example: _to **grant rights**_ (new) * `VID`: verbal idioms, for example: _to **go bananas**_ (previously ID). * `IRV`: inherently reflexive verbs, for example: _to **help oneself** to the cookies_ (previously IReflV). * `VPC.full`: fully non-compositional verb-particle constructions in which the particle totally changes the meaning of the verb, for example: _to **do in**_ (previously VPC). * `VPC.semi`: semi-compositional verb-particle constructions, in which the particle adds a partly predictable but non-spatial meaning to the verb, for example: _to **eat up**_ (new). * `MVC`: multi-verb constructions, for example: _to **make do**_ (new). * `IAV`: inherently adpositional verbs, for example: _to **come across**_ (new). * If the current line contains a lexicalized component of the VMWE which is **not the first** one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label. Scripts ------- The `bin` directory contains useful scripts: * `bin/evaluate.py`: script that assesses system predictions according to gold annotations by calculating the [evaluation metrics](http://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_04_LAW-MWE-CxG_2018&subpage=CONF_50_Evaluation_metrics). * `bin/validate_cupt.py`: script for checking if the prediction file is in the proper format. * `bin/parsemetsv2cupt.py`: script that converts the old PARSEME-TSV format of shared task 1.0 (2017) into the new cupt format. * `bin/average_of_evaluations.py`: script used to calculate cross-lingual macro-averages for general and phenomenon-specific metrics. Trial data ---------- The `trial` directory contains trial data: * `trial/EN_trial-test_pred.cupt`: Example of file containing predicted VMWEs in English. * `trial/EN_trial-test_gold.cupt`: Example of file containing gold VMWE annotations in English. * `trial/EN_trial-train.cupt`: Example of file containing a training corpus in English.