HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech
Please use the following text to cite this item or export to a predefined format:
Chromý, Jan; Ceháková, Markéta and Brand, James, 2025,
HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6121.
Authors
Item identifier
Date issued
2025-11-25
Size
4476489 entries
Language(s)
Description
The HeCz corpus comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging.
Acknowledgement
Grantová agentura České republiky
Project code:23-06796S
Project name:Cze-Lex: Kvantifikace českého lexikonu
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- HeCz_stimuli_annotation.csv
- Size
- 6.93 MB
- Format
- text/csv
- Description
- MD5
- a76153b0a355bc3c3053fae948db8190

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- HeCz_stimuli_annotation_description.txt
- Size
- 6.03 KB
- Format
- text/plain
- Description
- MD5
- 3390e5eaf57a4f600701ede83ddb8d6a

############################################ This is the description of variables used in the data file HeCz_stimuli_annotation.csv. ### ItemID ### Item ID. ### List ### Factor indicating which of the 16 randomized headline lists (119–120 items each) the participant was assigned before the reading task. ### Sentence ### The wording of the headline used in the data collection. ### WordNumber ### The position of the word in the headline (1st word = 1, 2nd word = 2 etc.). ### WordPresented ### The exact form of the word presented (including punctuation etc.). ### WordClean_lc ### Word presented in the task, but in lower case and cleaned from punctuation and other symbols such as quotation marks, brackets etc. ### Lemma ### Lemma of the word. ### Lemma_lc ### Lemma of the word in lowercase. ### Multiword ### If "yes", then the word is a part of a multiword sequence. Typically, these are multiword pronouns (e.g. Von Der Leyen, Washington Post, O2 arena, Premier League ). ### LemmaMultiword ### The lemma of the multiword sequence. ### WordPresentedCharN ### Number of characters of WordPresented. ### HeadlineWordN ### Length of the headline in words. ### POS ### Part of speech or more precisely type of word. 13 values: (i) abbreviation = abbreviation or acronym, (ii) adjective, (iii) adverb, (iv) conjunction, (v) interjection, (vi) multiple = part of a multiword sequence, (vii) noun, (viii) numeral = numeral or number, (ix) particle, (x) preposition, (xi) pronoun, (xii) symbol = specific character(s) such as %, dash etc., (xiii) verb. ### GramGender ### Grammatical gender of the word. Coded for adjectives, nouns, verb forms which express gender, pronouns which express gender, numerals which express gender, and nouns. Three values: (i) masc = masculine, (ii) fem = feminine, (iii) neut = neuter, (iv) indeclinable = used for specific type of borrowed adjectives which do not carry gender feature (such as kardio, online, fake). ### Number ### Grammatical . . .
- Name
- HeCz_demographics.csv
- Size
- 411.22 KB
- Format
- text/csv
- Description
- MD5
- 6a63e6255d1d52291c01624fbd86b981

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- HeCz_demographics_description.txt
- Size
- 3.07 KB
- Format
- text/plain
- Description
- MD5
- b40ec70b22922670fa4354aa1d3284b9

############################################ This is the description of variables used in the data file "HeCz_demographics.csv". ### Participant ### Participant ID. If the participant took part in both rounds, the same ID is used. ### Round ### The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1). ### List ### The list of items which the participant was ascribed to. Altogether, 16 lists were used. ### Gender ### Participant's gender. Four values were used: (i) female, (ii) male, (iii) nonbinary, (iv) notdisclosed (= participant refused to disclose their gender) ### Age ### Participant's age in years. ### L2_language ### Participant's L2 language (the second language which they know the best). ### L2_language_level ### The reported L2 level, lowest to highest A1/A2/B1/B2/C1/C2 (values are based on the European reference levels). ### foreign_language_exposure ### The extent participant is exposed to foreign languages in everyday life, 0 not at all, 10 all the time. ### Dyslexia ### Two values: (i) no = no reported serious problems with reading, (ii) yes = reported serious problems with reading (such as dyslexia). ### VisionCorrection ### Whether participant should use vision correction whilst reading, e.g. glasses, contact lenses. ### VisionCorrectionOn ### If the vision correction was actually used during the experiment. ### ReadingFocus ### Response to question "When reading, to what extent do you focus on how the text is written and how individual wording affects its meaning?" 1 = not at all, 7 = all the time. ### ReadingImmersion ### Response to question "When reading, I experience the text in all its details and try to understand it in detail." 1 = strongly disagree, 7 = strongly agree. ### Noise ### Response to question "How noisy is your current environment?" 1 = very quiet, 7 = very loud ### Alcohol24 ### Response to question "Have you drunk alco . . .
- Name
- HeCz_response_accuracy_description.txt
- Size
- 1.52 KB
- Format
- text/plain
- Description
- MD5
- ea5b60d28982b3c878eb60d42ce7633e

############################################ This is the description of variables used in the data file HeCz_response_accuracy.csv. ### Participant ### Participant ID. If the participant took part in both rounds, the same ID is used. ### Round ### The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1). ### List ### The list of items which the participant was ascribed to. Altogether, 16 lists were used. ### ItemID ### Item ID. ### Sentence ### The wording of the headline used in the data collection. ### Question ### The wording of the comprehension question used. ### QuesType ### The comprehension question type. Seven values used: (i) adv = question targeted adverbial information, (ii) attr = question targeted attribute, (iii) loc = question targeted a location, (iv) obj = question targeted an object, (v) subj = question targeted a subject, (vi) temp = question targeted temporal information, (vii) verb = question targeted a verb. ### AnswerTime ### Time is ms it took the participant to respond the questions. ### Position ### Position of the targeted information in the headline (i.e. word number). ### Response ### Response given to the comprehension question. Three values: (i) 0 = yes, (ii) 1 = no, (iii) 3 = "I do not know". ### CorrectAnswer ### Correct answer to the comprehension question: (i) yes, (ii) no. ### Correct ### Response correctness: (i) yes = correct answer, (ii) no = incorrect answer.
- Name
- HeCz_response_accuracy.csv
- Size
- 63.02 MB
- Format
- text/csv
- Description
- MD5
- 3ae7961fe46e7fdbe0db80b5cbdaa90e

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- HeCz_reaction_times_description.txt
- Size
- 760 B
- Format
- text/plain
- Description
- MD5
- 7220bc90fd847c291d43c8bead38cd9f

############################################ This is the description of variables used in the data file "HeCz_reaction_times.csv". ### Participant ### Participant ID. If the participant took part in both rounds, the same ID is used. ### Round ### The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1). ### ItemID ### Item ID. ### Sentence ### The wording of the headline used in the data collection. ### WordNumber ### The position of the word in the headline (1st word = 1, 2nd word = 2 etc.). ### WordPresented ### The exact form of the word presented (including punctuation etc.). ### RTWord ### Reaction time on the given word in ms.
- Name
- HeCz_reaction_times.csv
- Size
- 591.34 MB
- Format
- text/csv
- Description
- MD5
- 77fab28974c6c4510d7e7cd7b079dbf2

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

