In this paper, we deal with coreference in Czech complex predicates with light verbs. The syntactic structure of these complex predicates is typically formed by valency complementations of both light verbs and predicative nouns, and its analysis poses a serious challenge for all syntactic theories. We demonstrate that in both deep and surface syntactic structure formation of these predicates, coreference between verbal and nominal valency complementations plays a substantial role. Description of individual types of coreference within complex predicates is thus essential for a comprehensive analysis of their syntactic behavior. In an extensive data annotation project, we identify, delimit and thoroughly describe individual types of coreferential relations between valency complementations of Czech complex predicates.
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the Hamburg Comprehensibility Concept (Langer, Schulz von Thun, Tausch, 1974). It has been built as a pilot data set to explore the Linguistic Factors of Readability (hence the LiFR acronym) in Czech administrative and legal texts, modeling their correlation with actually observed reading comprehension. The corpus is comprised of 18 documents in total; that is, six different texts from the legal/administration domain, each in three versions: the original and two paraphrases. Each such document triple shares one reading-comprehension test administered to at least thirty readers of random gender, educational background, and age. The data set also captures basic demographic information about each reader, their familiarity with the topic, and their subjective assessment of the stylistic properties of the given document, roughly corresponding to the key text properties identified by the Hamburg Comprehensibility Concept.
Changes to the previous version and helpful comments
• File names of the comprehension test results (self-explanatory)
• Corrected one erroneous automatic evaluation rule in the multiple-choice evaluation (zahradnici_3,
TRUE and FALSE had been swapped)
• Evaluation protocols for both question types added into Folder lifr_formr_study_design
• Data has been cleaned: empty responses to multiple-choice questions were re-inserted. Now, all surveys
are considered complete that have reader’s subjective text evaluation complete (these were placed at
the very end of each survey).
• Only complete surveys (all 7 content questions answered) are represented. We dropped the replies of
six users who did not complete their surveys.
• A few missing responses to open questions have been detected and re-inserted.
• The demographic data contain all respondents who filled in the informed consent and the demographic
details, with respondents who did not complete any test survey (but provided their demographic
details) in a separate file. All other data have been cleaned to contain only responses by the regular
respondents (at least one completed survey).
Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept
Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features based on the Hamburg Comprehensibility Concept
ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs.
The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides an elaborated set of morphological, syntactic and semantic features, including information on aspectual counterparts of verbs or paraphrasability conditions of given verbs.
The format of ParaDi has been designed with respect to both human and machine readability - the dictionary is represented as a plain table in TSV format, as it is a flexible and language-independent data format.
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level. and the Grant Agency of the Czech Republic: grants P406/12/0658 "Coreference, discourse relations and information structure in a contrastive perspective", P406/2010/0875 "Computational Linguistics: Explicit description of language and annotated data focused on Czech", 405/09/0729 "From the structure of a sentence to textual relationships", and GPP406/12/P175 (Selected derivational relations for automatic processing of Czech);
the Ministry of Education, Youth and Sports of the Czech Republic: the KONTAKT project ME10018 "Towards a computational analysis of text structure" and the LINDAT-Clarin project LM2010013;
the Grant Agency of Charles University in Prague: GAUK 103609 "Textual (Inter-sentential) Relations and their Representation in a Language Corpus" and GAUK 4383/2009 "Methods of coreference resolution".