Czech RST Discourse Treebank 1.0
Please use the following text to cite this item or export to a predefined format:
Poláková, Lucie; Zikánová, Šárka; Mírovský, Jiří and Hajičová, Eva, 2023,
Czech RST Discourse Treebank 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5174.
Authors
Item identifier
Project URL
Date issued
2023-06-30
Size
54 articles,
901 sentences,
14514 tokens
Language(s)
Description
The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text document in the treebank is represented as a single tree-like structure, the nodes (discourse units) are interconnected through hierarchical rhetorical relations.
The dataset also contains concurrent annotations of five double-annotated documents.
The original texts are a part of the data annotated in the Prague Dependency Treebank, although the two projects are independent.
Acknowledgement
The Grant Agency of the Czech Republic
Project code:20-09853S
Project name:Global Coherence of Czech Texts in the Corpus-Based Perspective
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- README.TXT
- Size
- 3.9 KB
- Format
- text/plain
- Description
- Text
- MD5
- 04ec03e6206fb2ba96141b2c1967eabc

===============================================
Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0)
===============================================
Authors
=======
Lucie Poláková (Charles University, Faculty of Mathematics and Physics),
Šárka Zikánová (Charles University, Faculty of Mathematics and Physics),
Jiří Mírovský (Charles University, Faculty of Mathematics and Physics)
Eva Hajičová (Charles University, Faculty of Mathematics and Physics),
Introduction
============
The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0, Poláková et al., 2023)
is a dataset of 54 Czech journalistic texts manually annotated using
the Rhetorical Structure Theory (RST; Mann and Thompson, 1988).
Each text document in the treebank is represented as a single tree-like
structure, the nodes (discourse units) are interconnected through
hierarchical rhetorical relations.
The dataset also contains concurrent annotations of five double-annotated
documents.
The original texts are a part of the data annotated in the Prague Dependency
Treebank (Hajič et al., 2020), although the two projects are independent.
Please visit https://ufal.mff.cuni.cz/czrst-dt1.0 for detailed and
updated information about the corpus.
Data Format
===========
The data can be found in directory data in the
following subdirectories:
TXT - original texts
RS3 - RST annotations of the texts in RS3 format
IAA - double annotated documents in two versions:
- pre-curated (note: the curated version is in directory RS3)
- pre-curated and modified to one tree (for IAA measurement)
How to get and browse the data
==============================
The data can be downloaded from the LINDAT/CLARIAH-CZ
repository: http://hdl.handle.net/11234/1-5174,
see the licence below.
The data can be opened using the RSTWeb annotation
tool (Gessler et al., 2019):
https://gucorpling.org/rstweb/info/
Citation
========
Please cite CzRST-DT 1.0 when using the corpus for your research:
Lucie Poláková, Šárka Zikánová, Jiří M . . .- Name
- CzRST-DT_1.0.zip
- Size
- 203.38 KB
- Format
- application/zip
- Description
- Zip
- MD5
- 93b2a2beab1ff13f7dd652fa5de74bfb

- CzRST-DT_1.0
- README.TXT3 kB
- data
- IAA
- RS3
- ln94207_39.rs33 kB
- mf920925_021.rs34 kB
- lnd94103_003.rs32 kB
- cmpr9413_017.rs37 kB
- lnd94103_063.rs311 kB
- ln95049_086.rs35 kB
- ln95048_056.rs38 kB
- ln94202_49.rs33 kB
- ln94200_8.rs32 kB
- mf930713_099.rs35 kB
- ln94207_83.rs311 kB
- mf930713_055.rs36 kB
- ln95047_134.rs35 kB
- ln94200_112.rs35 kB
- ln95048_140.rs34 kB
- ln94203_145.rs37 kB
- ln94202_135.rs34 kB
- mf920922_138.rs33 kB
- cmpr9415_032.rs34 kB
- cmpr9410_047.rs312 kB
- ln94200_84.rs33 kB
- ln95048_055.rs34 kB
- ln94207_54.rs311 kB
- cmpr9413_004.rs34 kB
- lnd94103_129.rs33 kB
- ln94200_167.rs33 kB
- mf920925_087.rs34 kB
- mf930713_110.rs311 kB
- lnd94103_013.rs34 kB
- lnd94103_145.rs36 kB
- lnd94103_033.rs33 kB
- ln95048_058.rs35 kB
- mf930709_087.rs38 kB
- mf920925_018.rs35 kB
- mf920922_105.rs39 kB
- ln94203_100.rs35 kB
- lnd94103_053.rs33 kB
- ln95049_100.rs38 kB
- ln94210_147.rs37 kB
- mf930709_083.rs33 kB
- mf920925_114.rs34 kB
- ln95049_019.rs33 kB
- ln95048_122.rs33 kB
- mf930713_013.rs34 kB
- mf920922_133.rs33 kB
- ln94209_45.rs39 kB
- mf930709_058.rs34 kB
- ln94207_16.rs38 kB
- ln94203_43.rs34 kB
- ln94200_170.rs311 kB
- cmpr9413_026.rs33 kB
- ln94208_143.rs33 kB
- cmpr9413_034.rs38 kB
- ln94206_47.rs37 kB
- TXT
- lnd94103_013.txt1 kB
- lnd94103_145.txt1 kB
- lnd94103_033.txt733 B
- mf930709_087.txt2 kB
- ln95048_058.txt1 kB
- ln94203_100.txt1 kB
- mf920925_018.txt1 kB
- mf920922_105.txt2 kB
- lnd94103_053.txt1008 B
- ln95049_100.txt1 kB
- ln94210_147.txt2 kB
- mf930709_083.txt944 B
- mf920925_114.txt997 B
- ln95048_122.txt622 B
- ln95049_019.txt821 B
- mf920922_133.txt595 B
- mf930713_013.txt1 kB
- ln94209_45.txt3 kB
- mf930709_058.txt883 B
- ln94207_16.txt2 kB
- ln94203_43.txt1 kB
- ln94200_170.txt3 kB
- cmpr9413_026.txt709 B
- ln94208_143.txt1 kB
- cmpr9413_034.txt2 kB
- ln94206_47.txt1 kB
- ln94207_39.txt940 B
- lnd94103_003.txt684 B
- mf920925_021.txt1 kB
- lnd94103_063.txt4 kB
- cmpr9413_017.txt2 kB
- ln95049_086.txt1 kB
- ln94202_49.txt909 B
- ln95048_056.txt2 kB
- ln94200_8.txt735 B
- mf930713_099.txt967 B
- ln94207_83.txt3 kB
- mf930713_055.txt2 kB
- ln95047_134.txt1 kB
- ln94200_112.txt1 kB
- ln95048_140.txt1 kB
- ln94203_145.txt2 kB
- ln94202_135.txt1 kB
- mf920922_138.txt801 B
- cmpr9415_032.txt1 kB
- cmpr9410_047.txt4 kB
- ln94200_84.txt1 kB
- ln94207_54.txt3 kB
- ln95048_055.txt960 B
- cmpr9413_004.txt1 kB
- lnd94103_129.txt568 B
- mf920925_087.txt1 kB
- ln94200_167.txt1 kB
- mf930713_110.txt3 kB

