This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech

Please use the following text to cite this item or export to a predefined format:
Chromý, Jan; Ceháková, Markéta and Brand, James, 2025, HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-6121.
Date issued
2025-11-25
Size
4476489 entries
Language(s)
Description
The HeCz corpus comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
HeCz_stimuli_annotation.csv
Size
6.93 MB
Format
text/csv
Description
MD5
a76153b0a355bc3c3053fae948db8190
Preview
  File Preview
Name
HeCz_stimuli_annotation_description.txt
Size
6.03 KB
Format
text/plain
Description
MD5
3390e5eaf57a4f600701ede83ddb8d6a
Preview
  File Preview
    ############################################
    This is the description of variables used in the data file HeCz_stimuli_annotation.csv.
    
    ### ItemID ###
    Item ID.
    
    ### List ###
    Factor indicating which of the 16 randomized headline lists (119–120 items each) the participant was assigned before the reading task.
    
    ### Sentence ###
    The wording of the headline used in the data collection.
    
    ### WordNumber ###
    The position of the word in the headline (1st word = 1, 2nd word = 2 etc.).
    
    ### WordPresented ###
    The exact form of the word presented (including punctuation etc.). 
    
    ### WordClean_lc ###
    Word presented in the task, but in lower case and cleaned from punctuation and other symbols such as quotation marks, brackets etc.
    
    ### Lemma ###
    Lemma of the word.
    
    ### Lemma_lc ###
    Lemma of the word in lowercase.
    
    ### Multiword ###
    If "yes", then the word is a part of a multiword sequence. Typically, these are multiword pronouns (e.g. Von Der Leyen, Washington Post, O2 arena, Premier League
    ). 
    
    ### LemmaMultiword ###
    The lemma of the multiword sequence.
    
    ### WordPresentedCharN ###
    Number of characters of WordPresented.
    
    ### HeadlineWordN ###
    Length of the headline in words.
    
    ### POS ###
    Part of speech or more precisely type of word. 13 values: (i) abbreviation = abbreviation or acronym, (ii) adjective, (iii) adverb, (iv) conjunction, (v) interjection, (vi) multiple = part of a multiword sequence, (vii) noun, (viii) numeral = numeral or number, (ix) particle, (x) preposition, (xi) pronoun, (xii) symbol = specific character(s) such as %, dash etc., (xiii) verb.
    
    ### GramGender ###
    Grammatical gender of the word. Coded for adjectives, nouns, verb forms which express gender, pronouns which express gender, numerals which express gender, and nouns. Three values: (i) masc = masculine, (ii) fem = feminine, (iii) neut = neuter, (iv) indeclinable = used for specific type of borrowed adjectives which do not carry gender feature (such as kardio, online, fake). 
    
    ### Number ###
    Grammatical . . .
Name
HeCz_demographics.csv
Size
411.22 KB
Format
text/csv
Description
MD5
6a63e6255d1d52291c01624fbd86b981
Preview
  File Preview
Name
HeCz_demographics_description.txt
Size
3.07 KB
Format
text/plain
Description
MD5
b40ec70b22922670fa4354aa1d3284b9
Preview
  File Preview
    ############################################
    This is the description of variables used in the data file "HeCz_demographics.csv". 
    
    ### Participant ###
    Participant ID. If the participant took part in both rounds, the same ID is used.
    
    ### Round ###
    The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1).
    
    ### List ###
    The list of items which the participant was ascribed to. Altogether, 16 lists were used.
    
    ### Gender ###
    Participant's gender. Four values were used: (i) female, (ii) male, (iii) nonbinary, (iv) notdisclosed (= participant refused to disclose their gender)
    
    ### Age ###
    Participant's age in years.
    
    ### L2_language ###
    Participant's L2 language (the second language which they know the best).
    
    ### L2_language_level ###
    The reported L2 level, lowest to highest A1/A2/B1/B2/C1/C2 (values are based on the European reference levels). 
    
    ### foreign_language_exposure ###
    The extent participant is exposed to foreign languages in everyday life, 0 not at all, 10 all the time.
    
    ### Dyslexia ###
    Two values: (i) no = no reported serious problems with reading, (ii) yes = reported serious problems with reading (such as dyslexia).
    
    ### VisionCorrection ###
    Whether participant should use vision correction whilst reading, e.g. glasses, contact lenses.
    
    ### VisionCorrectionOn ###
    If the vision correction was actually used during the experiment.
    
    ### ReadingFocus ###
    Response to question "When reading, to what extent do you focus on how the text is written and how individual wording affects its meaning?" 1 = not at all, 7 = all the time.
    
    ### ReadingImmersion ###
    Response to question "When reading, I experience the text in all its details and try to understand it in detail." 1 = strongly disagree, 7 = strongly agree.
    
    ### Noise ###
    Response to question "How noisy is your current environment?" 1 = very quiet, 7 = very loud
    
    ### Alcohol24 ###
    Response to question "Have you drunk alco . . .
Name
HeCz_response_accuracy_description.txt
Size
1.52 KB
Format
text/plain
Description
MD5
ea5b60d28982b3c878eb60d42ce7633e
Preview
  File Preview
    ############################################
    This is the description of variables used in the data file HeCz_response_accuracy.csv. 
    
    ### Participant ###
    Participant ID. If the participant took part in both rounds, the same ID is used.
    
    ### Round ###
    The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1).
    
    ### List ###
    The list of items which the participant was ascribed to. Altogether, 16 lists were used.
    
    ### ItemID ###
    Item ID.
    
    ### Sentence ###
    The wording of the headline used in the data collection.
    
    ### Question ###
    The wording of the comprehension question used.
    
    ### QuesType ###
    The comprehension question type. Seven values used: (i) adv = question targeted adverbial information, (ii) attr = question targeted attribute, (iii) loc = question targeted a location, (iv) obj = question targeted an object, (v) subj = question targeted a subject, (vi) temp = question targeted temporal information, (vii) verb = question targeted a verb.
    
    ### AnswerTime ###
    Time is ms it took the participant to respond the questions.
    
    ### Position ###
    Position of the targeted information in the headline (i.e. word number).
    
    ### Response ###
    Response given to the comprehension question. Three values: (i) 0 = yes, (ii) 1 = no, (iii) 3 = "I do not know".
    
    ### CorrectAnswer ###
    Correct answer to the comprehension question: (i) yes, (ii) no.
    
    ### Correct ###
    Response correctness: (i) yes = correct answer, (ii) no = incorrect answer.
    
Name
HeCz_response_accuracy.csv
Size
63.02 MB
Format
text/csv
Description
MD5
3ae7961fe46e7fdbe0db80b5cbdaa90e
Preview
  File Preview
Name
HeCz_reaction_times_description.txt
Size
760 B
Format
text/plain
Description
MD5
7220bc90fd847c291d43c8bead38cd9f
Preview
  File Preview
    ############################################
    This is the description of variables used in the data file "HeCz_reaction_times.csv". 
    
    ### Participant ###
    Participant ID. If the participant took part in both rounds, the same ID is used.
    
    ### Round ###
    The round of testing with two values: (i) round1 = first testing round, (ii) round2 = second testing round (approximately one month after Round1).
    
    ### ItemID ###
    Item ID.
    
    ### Sentence ###
    The wording of the headline used in the data collection.
    
    ### WordNumber ###
    The position of the word in the headline (1st word = 1, 2nd word = 2 etc.).
    
    ### WordPresented ###
    The exact form of the word presented (including punctuation etc.). 
    
    ### RTWord ###
    Reaction time on the given word in ms.
    
Name
HeCz_reaction_times.csv
Size
591.34 MB
Format
text/csv
Description
MD5
77fab28974c6c4510d7e7cd7b079dbf2
Preview
  File Preview