PDiT-EDA 1.0 (Zikánová et al., 2018) is a treebank with rich annotation of discourse phenomena developed (2017 – 2018) within the project Implicitní vztahy v textové koherenci (Implicit relations in text coherence), i.e. project GA17-03461S of the Grant Agency of the Czech Republic.
The corpus contains extended annotation of discourse relations of a subset of the Prague Discourse Treebank 2.0 (Rysová et al., 2016), a large corpus annotated manually with explicit discourse relations, and newly adds implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.
PDiT-EDA 1.0 was published in December 20, 2018 in the Lindat/Clarin repository.
PDiT-EDA 1.0 can be downloaded as a single zip archive from the LINDAT-Clarin repository. It is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
After unzipping the downloaded archive, the data can be found in the directory data
, where they are further divided into fifteen subdirectories representing individual genres (advice column, collection, comment, critical review, description, invitation, letters from readers, news report, overview, personality-focused interview, readers‘ survey, reflective essay, sports news, topical interview, weather forecast). Annotation of each document is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz
), morphological layer (*.m.gz
), analytical layer (*.a.gz
), and tectogrammatical layer(*.t.gz
).
The data are stored in the Prague Markup Language format (PML, Pajas and Štěpánek 2008), which is an XML based format for linguistic annotations (esp. treebanks). For the sake of completeness, PML schemata of the files can be found in the directory resources
(the schemata are XML files that describe the structure of the annotated files).
Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open, browse and modify the data. The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified at the page for your operating system.
Now, TrEd is able to open the data of PDiT-EDA 1.0. To see the annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz
, and switch Mode:
(top right corner) to PML_T_Discourse
.
In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at
ufal.mff.cuni.cz).
If you use the corpus data or for whatever other reason wish to refer to the data, please cite the publication of the data:
Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018
For documentation and more information about PDiT-EDA 1.0, please go to the PDiT-EDA 1.0 home page.
Petr Pajas and Jan Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.
Magdaléna Rysová, Pavlína Synková, Jiří Mírovský, Eva Hajičová, Anna Nedoluzhko, Radek Ocelák, Jiří Pergler, Lucie Poláková, Veronika Pavlíková, Jana Zdeňková, Šárka Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, http://hdl.handle.net/11234/1-1905, Dec 2016
Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018