FicTree 1.0
Please use the following text to cite this item or export to a predefined format:
Jelínek, Tomáš; Hnátková, Milena and Skoumalová, Hana, 2017,
FicTree 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-2517.
Authors
Item identifier
Project URL
Referenced by
Date issued
2017-11-15
Size
12760 sentences
Language(s)
Description
FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The syntactic annotation of the treebank was first performed by two distinct parsers (MSTParser and MaltParser) trained on the PDT training data, then manually corrected. Any differences between the two versions were resolved manually (by another annotator).
The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: lemma, tag, ID (word index in the sentence), head and deprel (analytical function, afun in the PDT formalism). The texts are shuffled in random chunks of maximum 100 words (respecting sentence boundaries). Each chunk is provided as a separate file, with the suggested division into train, dev and test sets written as file prefix.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2015044
Project name:Český národní korpus
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- FicTree.tgz
- Size
- 1.44 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 41a9c8782f14df23477f7a1caa67a74e

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

