This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

FicTree 1.0

Please use the following text to cite this item or export to a predefined format:
Jelínek, Tomáš; Hnátková, Milena and Skoumalová, Hana, 2017, FicTree 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-2517.
Date issued
2017-11-15
Size
12760 sentences
Language(s)
Description
FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator). The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
Acknowledgement
Subject(s)
 Files in this item
Name
FicTree.tgz
Size
1.44 MB
Format
application/x-gzip
Description
gzip Archive
MD5
41a9c8782f14df23477f7a1caa67a74e
Preview
  File Preview