PDiT: Prague Discourse Treebank

Introduction

Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5 (PDT; Bejček et al. 2011), which is a revised, updated and extended version of the Prague Dependency Treebank 2.0 (Hajič et al. 2006). It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. The discourse layer of the treebank contains two subprojects:

  1. lexically-grounded approach of identification of discourse connectives, discourse units linked by them and semantic relations between these units, and
  2. annotations of extended textual coreference and bridging relations.

With its 49,431 manually annotated sentences from Czech newspapers, the project serves as a large-scale resource for linguistic research in the area of discourse analysis as well as for computational experiments concerning automatic text analysis, information extraction, text summarization and other branches of NLP research.

Contrary to the majority of similarly aimed corpus projects, the discourse-related information has been annotated directly on the syntactic trees and technically is a part of the underlying syntax layer of the PDT. This methodological approach allows us to include discourse-relevant syntactic phenomena annotated earlier (such as e.g. discourse relations expressed by dependent clauses) into the discourse representation, and to take advantage of the syntactic structure itself (resolution of elliptical structures, parentheses, appositions etc.). Also, from the perspective of querying the treebank and visualizing, all the different types of linguistic information are interlinked and available/visible at once.

Bejček, Eduard, Panevová, Jarmila, Popelka, Jan, Smejkalová, Lenka, Straňák, Pavel, Ševčíková, Magda, Štěpánek, Jan, Toman, Josef, Žabokrtský, Zdeněk, Hajič, Jan. 2011. Prague Dependency Treebank 2.5. Data/software, Charles University in Prague, MFF, ÚFAL, Praha, Czechia, Dec 2011 (http://ufal.mff.cuni.cz/pdt2.5/)
Hajič, Jan, Panevová, Jarmila, Hajičová, Eva, Sgall, Petr, Pajas, Petr, Štěpánek, Jan, Havelka, Jiří, Mikulová, Marie, Žabokrtský, Zdeněk, Ševčíková-Razímová Magda. 2006. Prague Dependency Treebank 2.0. Software prototype, Linguistic Data Consortium, Philadelphia, PA, USA, ISBN 1-58563-370-4, www.ldc.upenn.edu, Jul 2006 (http://ufal.mff.cuni.cz/pdt2.0/)