Lexicon of Czech Discourse Connectives 0.5 (CzeDLex 0.5)

Introduction

CzeDLex 0.5 (Mírovský et al., 2017) is a pilot version of a new electronic Lexicon of Czech Discourse Connectives, developed at the Institute of Formal and Applied Linguistics in 2015 – 2017 within the COST-cz project TextLink-cz (LD15052) of the Ministry of Education, Youth and Sports of the Czech Republic. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (M. Rysová et al., 2016), a large corpus annotated manually with discourse relations. The most frequent lexicon entries have been manually checked and supplemented by additional information and English translations.

How to open/browse the data

CzeDLex 0.5 can be downloaded from the LINDAT-Clarin repository under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

In the distribution, the lexicon is available in two formats, PML and HTML:

PML

The lexicon data are stored in a single file czedlex0.5.pml in the Prague Markup Language (PML) format (which is an XML based format for linguistic annotations), located in the directory PML in the distribution. For the sake of completeness, also the PML schema of the lexicon czedlex_schema.xml (describing the structure of the data format) can be found in the same directory.

Tree editor TrEd (Pajas and Štěpánek, 2008) can be used to open and browse the lexicon. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system. After the installation, an extension needs to be installed:

Now, TrEd is able to open the CzeDLex data. In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

HTML/on-line

For the users' convenience, the data of the lexicon have been exported to the HTML format, which presents the most important properties of the lexicon entries in a graphical, user-friendly way, without a need to install any tools. It is available either as a part of the distribution (open the file index.html from the directory HTML in a web browser), or on-line. The HTML version of the lexicon allows to filter the list of lexicon entries by three criteria: the basic filter distinguishes the primary and secondary connectives, the second filter distinguishes the connectives according to discourse types they are able to express, and the last filter distinguishes the connectives according to their part of speech.

Documentation and publications

Lexicon Structure

Level-one entry

The level-one entry in the lexicon structure is represented by the lemma of the connective. It is encoded in the element lemma and contains the following information:

Level-two entry

For each level-one entry in the lexicon structure, its connective and non-connective usages are represented as level-two entries. In connective-usages, the discourse type (see Table 1) is used as the base for nesting, while in non-connective-usages, the part-of-speech appurtenance of the expressions is used. The second level entry of the lexicon is encoded in the element usage and contains the following information:

Table 1: List of possible discourse types (senses)
CONTRAST EXPANSION CONTINGENCY TEMPORAL
confrontation conjunction reason–result synchrony
opposition conjunctive alternative       pragmatic reason–result       precedence–succession
restrictive opposition       disjunctive alternative explication
pragmatic contrast instantiation condition
concession specification pragmatic condition
correction equivalence purpose
gradation generalization

Table 2: Possible values of the argument semantics (attribute arg_semantics)
relation argument semantics
concession concession:expectation
concession:contra-expectation
condition condition:condition
condition:result of condition
correction correction:claim
correction:correction
explication explication:claim
explication:argument
generalization generalization:more specific
generalization:less specific
gradation gradation:lower degree
gradation:higher degree
instantiation instantiation:general statement
instantiation:example
pragmatic condition pragmatic condition:pragmatic condition
pragmatic condition:result of pragmatic condition
pragmatic reason-result       pragmatic reason-result:pragmatic reason
pragmatic reason-result:pragmatic result
precedence-succession precedence-succession:precedence
precedence-succession:succession
purpose purpose:action
purpose:motivation
reason-result reason-result:reason
reason-result:result
restrictive opposition restrictive opposition:general statement
restrictive opposition:exception
specification specification:less specific
specification:more specific
all other relations symmetric

Corpus frequencies

Numbers of occurrences in the PDiT 2.0 were added to all individual variants, complex forms, modifications and realizations, as well as to connective and non-connective usages (level-two entries) and the whole lemmas (level-one entries), in two attributes: pdt_count and pdt_intra, capturing numbers of all vs. intra-sentential occurrences of the respective items.

Translations

Apart from English translations listed in the descriptions of level-one and level-two entries, all complex forms, modified forms, realizations, variants (when possible) and examples have been translated to English (the translations are captured in elements english at the respective places).

Updates, further information and publications

For updates and more information, please consult the on-line documentation to CzeDLex 0.5 and the following papers/articles written about CzeDLex:

Mírovský, J., Synková, P., Rysová, M., and L. Poláková: CzeDLex – A Lexicon of Czech Discourse Connectives. In: The Prague Bulletin of Mathematical Linguistics, No. 109, Univerzita Karlova, Prague, Czech Republic, ISSN 0032-6585, pp. 61-91, Oct 2017.

Synková, P., Rysová, M., Poláková, L. and J. Mírovský: Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus. In: Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, Computing Society of the Philippines, Cebu, Philippines, 2017.

Mírovský, J., Synková, P., Rysová, M., and L. Poláková: Designing CzeDLex – A Lexicon of Czech Discourse Connectives. In: Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, Kyung Hee University, Seoul, Korea, ISBN 978-89-6817-428-5, pp. 449-457, 2016.

References

Mírovský, J., Synková, P., Rysová, M., and L. Poláková: CzeDLex 0.5. Data/software, ÚFAL, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2538, Dec 2017.

Pajas, P. and J. Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Rysová, M., Synková, P., Mírovský, J., Hajičová, E., Nedoluzhko, A., Ocelák, R., Pergler, J., Poláková, L., Scheller, V., Zdeňková, J. and Š. Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016.