Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard lemma-disjoint train-dev-test split of a subset of noun paradigms of existing morphological dictionary Czech MorfFlex 2.0 (files train, dev and test-MorfFlex); and small set of neologisms from Čeština 2.0, annotated for inflected forms (file test-neologisms).
The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024.
Four languages are included in this release. English PropBank is omitted due to its license terms.
This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source.
---
rankings_task.csv
- This csv contains information about the contexts which are to be annotated:
- id: A unique identifier for each task.
- content: The context to be ranked.
---
rankings_ranking.csv
- This csv includes ranking information for various assignments. It contains four columns:
- id: A unique identifier for each ranking entry.
- score: The score assigned to the entry.
- word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator.
- assignment_id: A reference ID linking to the assignments.
---
rankings_assignment.csv
- This csv tracks the completion status of tasks by users. It includes four columns:
- id: A unique identifier for each assignment entry.
- is_completed: A binary indicator (1 for completed, 0 for not completed).
- task_id: A reference ID linking to the tasks.
- user_id: The identifier for the user who should complete the task (rank the words).
---
Known Issues:
Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary.
---
This dataset is a part of work from a bachelor thesis:
OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.