Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0)

Introduction

PDiT-EDA 1.0 (Zikánová et al., 2018) is a treebank with rich annotation of discourse phenomena developed (2017 – 2018) within the project Implicitní vztahy v textové koherenci (Implicit relations in text coherence), i.e. project GA17-03461S of the Grant Agency of the Czech Republic.

The corpus contains extended annotation of discourse relations of a subset of the Prague Discourse Treebank 2.0 (Rysová et al., 2016), a large corpus annotated manually with explicit discourse relations, and newly adds implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.

PDiT-EDA 1.0 was published in December 20, 2018 in the Lindat/Clarin repository.

Data, License and Availability

PDiT-EDA 1.0 can be downloaded as a single zip archive from the LINDAT-Clarin repository. It is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

After unzipping the downloaded archive, the data can be found in the directory data, where they are further divided into fifteen subdirectories representing individual genres (advice column, collection, comment, critical review, description, invitation, letters from readers, news report, overview, personality-focused interview, readers‘ survey, reflective essay, sports news, topical interview, weather forecast). Annotation of each document is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz), morphological layer (*.m.gz), analytical layer (*.a.gz), and tectogrammatical layer(*.t.gz).

The data are stored in the Prague Markup Language format (PML, Pajas and Štěpánek 2008), which is an XML based format for linguistic annotations (esp. treebanks). For the sake of completeness, PML schemata of the files can be found in the directory resources (the schemata are XML files that describe the structure of the annotated files).

How to browse the data

Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open, browse and modify the data. The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified at the page for your operating system.

After the installation, a few extensions need to be installed:

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
  4. Make sure that at least extensions "Discourse Annotation (discourse)" and "Prague Dependency Treebank 3.0 (pdt30)" are checked to install (if they are not in the list, they may have already been installed).
  5. Click on the button "Install Selected"; the selected extensions (and some dependencies) get installed.
  6. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open the data of PDiT-EDA 1.0. To see the annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz, and switch Mode: (top right corner) to PML_T_Discourse.

In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

How to cite

If you use the corpus data or for whatever other reason wish to refer to the data, please cite the publication of the data:

Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018

More Information

For documentation and more information about PDiT-EDA 1.0, please go to the PDiT-EDA 1.0 home page.

References

Petr Pajas and Jan Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Magdaléna Rysová, Pavlína Synková, Jiří Mírovský, Eva Hajičová, Anna Nedoluzhko, Radek Ocelák, Jiří Pergler, Lucie Poláková, Veronika Pavlíková, Jana Zdeňková, Šárka Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, http://hdl.handle.net/11234/1-1905, Dec 2016

Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018