Word Importance Dataset
Please use the following text to cite this item or export to a predefined format:
Osuský, Adam and Javorský, Dávid, 2024,
Word Importance Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5520.
Authors
Item identifier
Date issued
2024
Size
2861 tokens
Language(s)
Description
This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source.
---
rankings_task.csv
- This csv contains information about the contexts which are to be annotated:
- id: A unique identifier for each task.
- content: The context to be ranked.
---
rankings_ranking.csv
- This csv includes ranking information for various assignments. It contains four columns:
- id: A unique identifier for each ranking entry.
- score: The score assigned to the entry.
- word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator.
- assignment_id: A reference ID linking to the assignments.
---
rankings_assignment.csv
- This csv tracks the completion status of tasks by users. It includes four columns:
- id: A unique identifier for each assignment entry.
- is_completed: A binary indicator (1 for completed, 0 for not completed).
- task_id: A reference ID linking to the tasks.
- user_id: The identifier for the user who should complete the task (rank the words).
---
Known Issues:
Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary.
---
This dataset is a part of work from a bachelor thesis:
OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- rankings_task.csv
- Size
- 13.88 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- ea77936a8a4d8a13e9e272044abc8dcf

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- rankings_assignment.csv
- Size
- 6.48 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- d6bcd5b307765fe814ad854a2f18cb43

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- rankings_ranking.csv
- Size
- 58.17 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- c48cd70ec0f3365f785dc647221604bb

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

