Supplementary files for a comparative study of word-formation without the addition of derivational affixes (conversion) in English and Czech.
The two .csv files contain 300 verb-noun conversion pairs in English and 300 verb-noun conversion pairs in Czech, i.e. pairs where either the noun is created from the verb or the verb is created from the noun without the use of derivational affixes. In English, the noun and verb in the conversion pair have the same form. In Czech, the noun and verb in the conversion pair differ in inflectional affixes.
The pairs are supplied with manual semantic annotation based on cognitive event schemata.
A file with the Appendix includes a list of dictionary definition phrases used as a basis for the semantic annotation.
This dataset can serve as a training and evaluation corpus for the task of training keyword detection with speaker direction estimation (keyword direction of arrival - KWDOA).
It was created by processing the existing Speech Commands dataset [1] with the PyroomAcoustics library so that the resulting speech recordings simulate the usage of a circular microphone array with 4 microphones having a distance of 57 mm between adjacent microphones. Such design of a simulated microphone array was chosen in order to match the existing physical microphone array from the Seeeduino series.
[1] Warden, Pete. “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.” ArXiv.org, 2018, arxiv.org/abs/1804.03209
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns.
Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for the Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey. Also, in this version of the dataset, the textual answers to the question 1.6 "How many months have you been to this country?" were replaced by numbers, so that we could separate the data of respondents who arrived in the Czech Republic in February 2022 or later from the other data (0 for those staying in Czechia before February 2022, 1 for those staying in Czechia since February 2022 or later, 2 for those staying in other countries).
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns.
Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey.
VPS-GradeUp is a collection of triple manual annotations of 29 English verbs based on the Pattern Dictionary of English Verbs (PDEV) and comprising the following lemmas: abolish, act, adjust, advance, answer, approve, bid, cancel, conceive, cultivate, cure, distinguish, embrace, execute, hire, last, manage, murder, need, pack, plan, point, praise, prescribe, sail, seal, see, talk, urge . It contains results from two different tasks:
1. Graded decisions
2. Best-fit pattern (WSD) .
In both tasks, the annotators were matching verb senses defined by the PDEV patterns with 50 actual uses of each verb (using concordances from the BNC [2]). The verbs were randomly selected from a list of completed PDEV lemmas with at least 3 patterns and at least 100 BNC concordances not previously annotated by PDEV’s own annotators. Also, the selection excluded verbs contained in VPS-30-En[3], a data set we developed earlier. This data set was built within the project Reviving Zellig S. Harris: more linguistic information for distributional lexical analysis of English and Czech and in connection with the SemEval-2015 CPA-related task.
This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source.
---
rankings_task.csv
- This csv contains information about the contexts which are to be annotated:
- id: A unique identifier for each task.
- content: The context to be ranked.
---
rankings_ranking.csv
- This csv includes ranking information for various assignments. It contains four columns:
- id: A unique identifier for each ranking entry.
- score: The score assigned to the entry.
- word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator.
- assignment_id: A reference ID linking to the assignments.
---
rankings_assignment.csv
- This csv tracks the completion status of tasks by users. It includes four columns:
- id: A unique identifier for each assignment entry.
- is_completed: A binary indicator (1 for completed, 0 for not completed).
- task_id: A reference ID linking to the tasks.
- user_id: The identifier for the user who should complete the task (rank the words).
---
Known Issues:
Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary.
---
This dataset is a part of work from a bachelor thesis:
OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set.
The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one.
The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants.
In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets.
The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.