Extended Textual Coreference and Bridging Relations in PDT 2.0

Introduction

Annotation of extended textual coreference and bridging relations is a project related to the Prague Dependency Treebank 2.0 (PDT). It represents a new layer of manual annotation, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of the text structure. The annotation is a continuation of the annotation of grammatical and pronominal textual coreference that was completed for PDT 2.0 in 2003. The present project reflects two phenomena:

  1. coreferential relations (elements refer to the same extra-linguistic entity)
  2. bridging relations (elements refer to different extra-linguistic entities, but they stand in some semantic, lexical or conceptional relation)

In accordance with the pronominal textual coreference annotation in PDT, the annotation has been performed directly on the syntactic trees.

Detailed information about the annotation can be found in the technical report.

Data

The data consist of 49,431 manually annotated sentences from Czech newspapers (3,165 documents); detailed information about the original PDT data can be found in PDT Guide. 90% of the data have been annotated by one annotator only; 10% of the data have been annotated by two annotators in parallel, with discrepancies solved by a third annotator (directory dtest). The data are divided into ten directories (train-1 ... train-8, dtest, etest). Annotation of each document is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz), morphological layer (*.m.gz), analytical layer (*.a.gz), and tectogrammatical layer(*.t.gz); the annotation of extended textual coreference and bridging relations is a part of *.t.gz files.

How to browse the data

Tree editor TrEd is used to open and browse the data. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page.

After the installation, a few extensions need to be installed:

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
  4. Make sure that at least extensions "Prague Dependency Treebank 2.0 Annotation (pdt20)", "Bridging Anaphora and Textual Coreference Annotation", and "Non-Dependency Relations Annotation - common" are checked to install (if they are not in the list, they may have already been installed).
  5. Click on the button "Install Selected"; the selected extensions get installed.
  6. Close all TrEd windows including the main application window and start TrEd again.
Now, TrEd is able to open the data from the CD. To see the annotation of a document, open the respective file with extension t.gz.

Acknowledgment

The Development of Extended Textual Coreference and Bridging Relations in PDT 2.0 was supported by the following organizations and projects:

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
© 2011 Institute of Formal and Applied Linguistics, Charles University in Prague.