This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

CzeDLex 0.6

Please use the following text to cite this item or export to a predefined format:
Synková, Pavlína; Poláková, Lucie; Mírovský, Jiří and Rysová, Magdaléna, 2019, CzeDLex 0.6, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3074.
Date issued
2019-12-19
Size
204 entries
Language(s)
Description
CzeDLex 0.6 is the second development version of the lexicon of Czech discourse connectives. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (PDiT 2.0), a large corpus annotated manually with discourse relations. The most frequent entries in the lexicon (76 out of total 204 entries, covering more than 90% of the discourse relations annotated in PDiT 2.0), have been manually checked, translated to English and supplemented with additional linguistic information.
Acknowledgement
 Files in this item
Name
czedlex0.6.zip
Size
963.58 KB
Format
application/zip
Description
CzeDLex 0.6 distribution
MD5
68e1a8d8a09f6c65b5cb03bded68cc3e
Preview
  File Preview
  • czedlex0.6
    • index.html25 kB
    • HTML
      • list_all.html26 kB
      • lemmas
        • l-vyplyvat.html6 kB
        • l-vedle.html4 kB
        • l-spise-2.html5 kB
        • l-zaver.html9 kB
        • l-soubezne.html4 kB
        • l-za_to.html6 kB
        • l-predtim.html7 kB
        • l-zatim.html3 kB
        • l-sice.html25 kB
        • l-pristupovat.html4 kB
        • l-;.html6 kB
        • l-vice.html7 kB
        • l-prvni.html5 kB
        • l-nez.html6 kB
        • l-pad.html5 kB
        • l-ostatne.html5 kB
        • l-to.html8 kB
        • l-upresnit.html13 kB
        • l-jak.html10 kB
        • l-odhadnout.html4 kB
        • l-tak.html31 kB
        • l-konkretne.html4 kB
        • l-natoz.html4 kB
        • l-nejenom.html6 kB
        • l-v_podstate.html1 kB
        • l-argumentovat.html9 kB
        • l-prestoze.html6 kB
        • l-diky.html5 kB
        • l-priklad.html9 kB
        • l-zato.html9 kB
        • l-dokud.html8 kB
        • l-nato.html4 kB
        • l-zcasti_zcasti.html1 kB
        • l-napriklad.html5 kB
        • l-jeste.html13 kB
        • l-ani_+_pripad.html2 kB
        • l-souvislost.html12 kB
        • l-prece.html14 kB
        • l-proste.html6 kB
        • l-nasledne.html6 kB
        • l-na_zaklade.html2 kB
        • l-prece_jen.html2 kB
        • l-rozdil.html6 kB
        • l-nemluve_o.html3 kB
        • l-vlastne.html7 kB
        • l-Xneg.html13 kB
        • l-respektive.html6 kB
        • l-i_kdyz.html11 kB
        • l-kdyby.html10 kB
        • l-i_potom.html2 kB
        • l-pozdeji.html5 kB
        • l-takze.html10 kB
        • l-jestli.html5 kB
        • l-pres.html2 kB
        • l-vyjma.html2 kB
        • l-vest.html4 kB
        • l-zpusobit.html4 kB
        • l-rada.html4 kB
        • l-protoze.html10 kB
        • l-videt.html5 kB
        • l-prelozeno.html2 kB
        • l-misto.html7 kB
        • l-tedy.html25 kB
        • l-pricist.html5 kB
        • l-proto.html11 kB
        • l-nadto.html2 kB
        • l-az.html12 kB
        • l-zatimco.html6 kB
        • l-pokud.html12 kB
        • l-strana.html25 kB
        • l-zasluhou.html2 kB
        • l-tim.html18 kB
        • l-na_rozdil.html3 kB
        • l-ze.html27 kB
        • l-vinou.html2 kB
        • l-posledni.html4 kB
        • l-alespon.html7 kB
        • l-dovrseni.html2 kB
        • l-jednak.html9 kB
        • l-nybrz.html7 kB
        • l-ac.html3 kB
        • l-byt.html3 kB
        • l-vzdyt.html10 kB
        • l-jakmile.html6 kB
        • l-mezi.html2 kB
        • l-totiz.html19 kB
        • l-vzhledem_k.html11 kB
        • l-zduvodnit.html5 kB
        • l-i_tak.html3 kB
        • l-naopak.html13 kB
        • l-skutecnost.html11 kB
        • l-nebot.html9 kB
        • l-coz.html14 kB
        • l-jelikoz.html3 kB
        • l-krome.html11 kB
        • l-nejen.html14 kB
        • l-pravdepodobnejsi.html2 kB
        • l-presneji.html4 kB
        • l-leda.html2 kB
        • l-jenze.html12 kB
        • l-potom.html15 kB
        • l-pote.html17 kB
        • l-jmenovite.html3 kB
        • l--.html28 kB
        • l-sotva.html4 kB
        • l-dokonce.html8 kB
        • l-duvod.html43 kB
        • l-dale.html9 kB
        • l-treba.html8 kB
        • l-pouze.html18 kB
        • l-mimo_jine.html5 kB
        • l-zaroven.html7 kB
        • l-jenomze.html3 kB
        • l-lec.html5 kB
        • l-ponevadz.html3 kB
        • l-tudiz.html5 kB
        • l-smer.html6 kB
        • l-ani.html9 kB
        • l-nicmene.html12 kB
        • l-neboli.html4 kB
        • l-zase.html14 kB
        • l-nehlede_na.html2 kB
        • l-presto.html13 kB
        • l-dodat.html16 kB
        • l-mezitim.html7 kB
        • l-ale.html54 kB
        • l-pripad.html36 kB
        • l-vsak.html40 kB
        • l-koneckoncu.html6 kB
        • l-kdykoli.html4 kB
        • l-nasledovat.html4 kB
        • l-tretice.html4 kB
        • l-pricina.html5 kB
        • l-spise.html9 kB
        • l-podminka.html11 kB
        • l-ovsem.html26 kB
        • l-kvuli.html3 kB
        • l-jakkoli.html4 kB
        • l-rovnez.html7 kB
        • l-ohled.html7 kB
        • l-ne.html10 kB
        • l-podobne.html4 kB
        • l-nejprve.html8 kB
        • l-ucel.html7 kB
        • l-cili.html9 kB
        • l-zkratka.html5 kB
        • l-budto.html4 kB
        • l-slovo.html5 kB
        • l-kontrastovat.html3 kB
        • l-k_tomu.html8 kB
        • l-prispivat.html4 kB
        • l-jen.html15 kB
        • l-receno.html10 kB
        • l-doplnit.html6 kB
        • l-li.html14 kB
        • l-soucasne.html14 kB
        • l-naproti.html6 kB
        • l-vzapeti.html9 kB
        • l-oproti.html3 kB
        • l-take.html10 kB
        • l-nakonec.html14 kB
        • l-:.html24 kB
        • l-popripade.html8 kB
        • l-at.html7 kB
        • l-pritom.html25 kB
        • l-plynout.html8 kB
        • l-vyjimka.html3 kB
        • l-pripadne.html9 kB
        • l-jestlize.html12 kB
        • l-kdyz.html27 kB
        • l-trebaze.html4 kB
        • l-nejenomze.html2 kB
        • l-nejenze.html4 kB
        • l-pokracovat.html15 kB
        • l-kdy.html17 kB
        • l-pricemz.html12 kB
        • l-tez.html9 kB
        • l-posleze.html8 kB
        • l-aneb.html3 kB
        • l-tj.html7 kB
        • l-nebo.html14 kB
        • l-jezto.html1 kB
        • l-dusledek.html5 kB
        • l-pak.html20 kB
        • l-a.html72 kB
        • l-navzdory.html5 kB
        • l-mimoto.html3 kB
        • l-aby.html19 kB
        • l-stejne.html14 kB
        • l-nikoli.html7 kB
        • l-ba.html7 kB
        • l-dosti.html2 kB
        • l-vysledek.html5 kB
        • l-znamenat.html21 kB
        • l-navic.html7 kB
        • l-kdezto.html3 kB
        • l-zahy.html5 kB
        • l-i.html22 kB
        • l-aniz.html12 kB
        • l-okamzik.html4 kB
        • l-jinak.html14 kB
        • l-oduvodneni.html4 kB
        • l-anebo.html6 kB
        • l-stejny.html4 kB
        • l-ci.html7 kB
      • index.html601 B
      • czedlex.css5 kB
      • lemma_types
        • selection.html571 B
        • list_secondary.html9 kB
        • list_primary.html17 kB
      • poss
        • list_numeral.html407 B
        • list_noun.html3 kB
        • selection.html1 kB
        • list_verb.html2 kB
        • list_punctuation.html668 B
        • list_conjunction coordinating.html4 kB
        • list_adverb.html8 kB
        • list_adjective.html564 B
        • list_particle.html2 kB
        • list_pronoun.html675 B
        • list_conjunction subordinating.html3 kB
        • list_preposition.html1 kB
      • header.html949 B
      • senses
        • list_instantiation.html1 kB
        • list_precedence-succession.html4 kB
        • list_pragmatic reason-result.html2 kB
        • list_condition.html4 kB
        • list_specification.html3 kB
        • list_conjunction.html8 kB
        • list_concession.html6 kB
        • list_gradation.html6 kB
        • list_restrictive opposition.html3 kB
        • list_synchrony.html2 kB
        • list_correction.html3 kB
        • selection.html2 kB
        • list_confrontation.html5 kB
        • list_pragmatic condition.html1 kB
        • list_pragmatic contrast.html1 kB
        • list_equivalence.html2 kB
        • list_opposition.html5 kB
        • list_purpose.html1 kB
        • list_reason-result.html9 kB
        • list_disjunctive alternative.html1 kB
        • list_explication.html3 kB
        • list_conjunctive alternative.html1 kB
        • list_generalization.html2 kB
    • PML
      • czedlex0.6.pml1 MB
      • czedlex_schema.xml13 kB
Name
czedlex0.6_index_lindat.html
Size
25.65 KB
Format
text/html
Description
CzeDLex 0.6 README
MD5
9d4349fa6dc25ba08a0b586faa626827
Preview
  File Preview
    Lexicon of Czech Discourse Connectives 0.6 (CzeDLex 0.6) – Introduction

    Lexicon of Czech Discourse Connectives 0.6 (CzeDLex 0.6)

    Introduction

    CzeDLex 0.6 (Synková et al., 2019) is an updated version of the electronic Lexicon of Czech Discourse Connectives, developed at the Institute of Formal and Applied Linguistics in 2015 – 2017 within the COST-cz project TextLink-cz (LD15052) of the Ministry of Education, Youth and Sports of the Czech Republic, and in 2019 within the project Shallow discourse parsing in Czech (GAČR GA19-03490S). CzeDLex 0.6 is an update of the previous version, CzeDLex 0.5, which was published in 2017. The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0 (M. Rysová et al., 2016), a large corpus annotated manually with discourse relations. The most frequent lexicon entries have been manually checked and supplemented with additional information and English translations.

    In total, there are 204 level-one entries in CzeDLex 0.6; 76 entries (covering more than 90% of the discourse relations annotated in the PDiT 2.0) have been fully manually checked and supplemented with additional information. See the documentation for details.

    How to open/browse the data

    CzeDLex 0.6 can be downloaded from the LINDAT-Clarin repository under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    In the distribution, the lexicon is available in two formats, PML and HTML:

    PML

    The lexicon data are stored in a single file czedlex0.6.pml in the Prague Markup Language (PML) format (which is an XML based format for linguistic annotations), located in the directory PML in the distribution. For the sake of completeness, also the PML schema of the lexicon czedlex_schema.xml (describing the structure of the data format) can be found in the same directory.

    Tree editor TrEd (Pajas and Štěpánek, 2008) can be used to open and browse the lexicon. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system. After the installation, an extension needs to be installed:

    • Start TrEd.
    • In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
    • Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
    • Make sure that at least the extension "Lexicon of Czech Discourse Connectives (czedlex)" is checked to install (if it is not in the list, it may have already been installed).
    • Click on the button "Install Selected"; the selected extensions get installed.
    • Close all TrEd windows including the main application window and start TrEd again.

    Now, TrEd is able to open the CzeDLex data. In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

    HTML/on-line

    For the users' convenience, the data of the lexicon have been exported to the HTML format, which presents the most important properties of the lexicon entries in a graphical, user-friendly way, without a need to install any tools. It is available either as a part of the distribution (open the file index.html from the directory HTML in a web browser), or on-line. The HTML version of the lexicon allows to filter the list of lexicon entries by three criteria: the basic filter distinguishes the primary and secondary connectives, the second filter distinguishes the connectives according to discourse types they are able to express, and the last filter distinguishes the connectives according to their part of speech.

    Documentation and publications

    Lexicon Structure

    Level-one entry

    The level-one entry in the lexicon structure is represented by the lemma of the connective. It is encoded in the element lemma and contains the following information:

    • element text: the lemma of the connective
    • element english: an approximate English translation for a basic orientation; more precise translations are given in connection with semantic discourse types at level-two entries
    • element type: the type of the connective: primary vs. secondary
    • element struct: the structure of the connective: it signals whether the connective is single such as proto [therefore] or complex such as jednak jednak [on the one hand on the other hand]. The complex connectives are further differentiated in the attribute type according to their placement in the argument(s): complex connectives with parts occurring in both arguments (e.g. jednak jednak [on the one hand on the other hand] or buď nebo [either or]) are labeled correlative, while complex connectives with all parts occurring in a single argument are labeled continuous if no word can be inserted between the parts of the connective (e.g. the connective i když [even if, although]), or discontinuous if other words can occur between the connective parts (e.g. a potom [and then]). Multiplied connectives in coordinations (e.g. protože ... and protože [because ... and because]) are labeled as multiple.
    • element variants: a list of variants of the connective: they are further specified in the attribute type as stylistic (cf. neutral tedy [so.neutral] vs. informal teda [so.informal]) or orthographic (e.g. mimoto vs. mimo to [both meaning: besides]), or inflection (e.g. the form čímž [by which] is the instrumental form of the connective with the nominative form což [which])
    • element conn-usages: a list of connective usages – level-two entries
    • element non-conn-usages: a list of non-connective usages – level-two entries
    • element note: important information not encoded in other attributes
    • attribute id: a lexicon-wide unique identifier of this level-one lexicon entry
    • element src: an identifier of an annotator editing this lexicon entry
    • element is_checked: is set to 1 for entries considered to be fully checked and annotated

    Level-two entry

    For each level-one entry in the lexicon structure, its connective and non-connective usages are represented as level-two entries. In connective-usages, the discourse type (see Table 1) is used as the base for nesting, while in non-connective-usages, the part-of-speech appurtenance of the expressions is used. The second level entry of the lexicon is encoded in the element usage and contains the following information:

    • element sense: the discourse type (see Table 1)
    • element scheme: the dependency scheme (used for secondary connectives only)
    • element gloss: a Czech expression disambiguating the meaning of the connective (a synonym or an explanatory phrase)
    • element english: an English translation (the gloss in English)
    • element pos: the part-of-speech appurtenance of the connective (the lemma) in the given usage. Conjunctions are further distinguished in the attribute subpos as coordinating or subordinating.
    • element syntax: for secondary connectives, the part-of-speech characteristics of the core word is accompanied by a syntactic characteristics for the whole secondary connective represented by this usage (nominal phrase, adjectival phrase, pronominal phrase, clause, adverbial phrase, or prepositional phrase).
    • element arg_semantics: this characteristics specifies the semantics of the argument the connective occurs in (see Table 2). From the semantic perspective, there is a basic difference between symmetric and asymmetric discourse relations. While both arguments of a symmetric relation (i.e. conjunction or synchrony) share the same general semantic characteristics, asymmetric discourse relations (e.g. reason–result or gradation) hold between arguments that have different semantic nature (e.g. one argument expresses the reason, the other the result). A connective of an asymmetric relation is characterized by its placement in one specific part of the relation it signals. For example, the coordinating conjunction tedy [thus] signals the result, while totiž [because] signals the reason. Similarly, the subordinating conjunctions než [until] and když [when] can be used for signalling precedence–succession – the former occurs in the argument expressing the event happening later, while the latter occurs in the argument expressing the earlier event. For symmetric relations, the element arg_semantics has the value symmetric. For complex correlative connectives forming level-one entries, the value is given for the second part of the connective.
    • element ordering: signals the linear order of the argument the connective occurs in (relatively to the other – external – argument). In the majority of cases, ordering is connected with the part-of-speech characteristics – coordinating conjunctions, adverbs and particles are placed in the second argument in the linear order, while subordinating conjunctions can be placed in either of the arguments. There are, however, exceptions – e.g. the particle nejenže [not only that] which occurs always in the first argument – that justify incorporation of this characteristics as a separate element into the lexicon. The element ordering has one of these five values: 1 for connectives occurring only in the first argument, 2 for connectives in the second argument, 1 or 2 for connectives in the first or second argument, 1 and 2 for complex correlative connectives and N/A for secondary connectives forming a separate syntactic unit (e.g. Důvod je jednoduchý. [The reason is simple.]) and therefore occurring entirely between the arguments.
    • element integration: captures the position of the connective within the argument. According to their origin and other possible functions in text, Czech connectives have different positions in the argument. Only subordinating conjunctions and prototypical coordinating conjunctions occupy the very beginning of the clause or sentence; the position of other connectives varies. Some of them are placed typically at the clitic, i.e. second position (e.g. však [however]), some of them are typically either on the first or on the second position (e.g. potom [then] or proto [therefore]) and for the class of focusing particles (i.e. expressions like také [also] or jenom [only]), the position is given by the information structure. For secondary connectives represented by the whole clause, integration is again N/A. Other values of this element, as follows from examples just mentioned, are first, second, first or second, and any. For complex correlative connectives forming level-one entries, the value is given for the second part of the connective only.
    • element realizations: a list of non-modified and non-complex secondary connectives from PDiT 2.0 represented by the given dependency scheme (applies only to secondary connectives)
    • element modifications: a list of the connective modifications: e.g. for the lemma potom [then] expressing precedence–succession, there is a modification teprve potom [only then]. Secondary connectives can be modified as well – cf. hlavní důvod proč [the main reason why]. Modifications are further distinguished in the attribute type as eval (evaluative), modal, and intense (intensifying).
    • element complex_forms: a list of complex connectives: e.g. for the lemma potom [then] expressing precedence–succession, there are for example complex forms a potom [and then] and nejdřív potom [first then]. Secondary connectives can have complex forms as well – cf. a z tohoto důvodu [and for this reason]. The criterion for a complex form to be placed in the level-two entry under a certain lemma is the ability of the basic connective (the given lemma) to express the same discourse type. It means that e.g. the complex connective přesto však [yet however] expressing the discourse type of concession is placed in respective level-two entries under both lemmas přesto [yet] and však [however], because both these single connectives individually also express the discourse type of concession in PDiT 2.0. Further, according to its placement either in both arguments or in one argument, each complex form is labeled in the attribute type as correlative, continuous, discontinuous or multiple (see above among the level-one entry characteristics). Within each complex form, element note may contain additonal information.
    • element examples: a list of a few illustrative examples from PDiT 2.0 and their English translations. Both intra-sentential and inter-sentential examples are – if available in the corpus – given for the connective usages and marked as such in the attribute type (intra vs. inter).
    • element is_rare: signals a rare use of the connective with the given discourse type
    • element register: captures whether the connective is used in the neutral, formal or informal register
    • element note: important information not encoded in other attributes
    • attribute id: a unique identifier of this level-two entry
    Table 1: List of possible discourse types (senses)
    CONTRAST EXPANSION CONTINGENCY TEMPORAL
    confrontation conjunction reason–result synchrony
    opposition conjunctive alternative       pragmatic reason–result       precedence–succession
    restrictive opposition       disjunctive alternative explication
    pragmatic contrast instantiation condition
    concession specification pragmatic condition
    correction equivalence purpose
    gradation generalization

    Table 2: Possible values of the argument semantics (attribute arg_semantics)
    relation argument semantics
    concession concession:expectation
    concession:contra-expectation
    condition condition:condition
    condition:result of condition
    correction correction:claim
    correction:correction
    explication explication:claim
    explication:argument
    generalization generalization:more specific
    generalization:less specific
    gradation gradation:lower degree
    gradation:higher degree
    instantiation instantiation:general statement
    instantiation:example
    pragmatic condition pragmatic condition:pragmatic condition
    pragmatic condition:result of pragmatic condition
    pragmatic reason-result       pragmatic reason-result:pragmatic reason
    pragmatic reason-result:pragmatic result
    precedence-succession precedence-succession:precedence
    precedence-succession:succession
    purpose purpose:action
    purpose:motivation
    reason-result reason-result:reason
    reason-result:result
    restrictive opposition restrictive opposition:general statement
    restrictive opposition:exception
    specification specification:less specific
    specification:more specific
    all other relations symmetric

    Corpus frequencies

    Numbers of occurrences in the PDiT 2.0 were added to all individual variants, complex forms, modifications and realizations, as well as to connective and non-connective usages (level-two entries) and the whole lemmas (level-one entries), in two attributes: pdt_count and pdt_intra, capturing numbers of all vs. intra-sentential occurrences of the respective items.

    Translations

    Apart from English translations listed in the descriptions of level-one and level-two entries, all complex forms, modified forms, realizations, variants (when possible) and (so far only some) examples have been translated to English (the translations are captured in elements english at the respective places).

    Updates, further information and publications

    For updates, see the web pages of the current development version of CzeDLex. For more information about CzeDLex 0.6, please consult the on-line documentation to CzeDLex 0.6 and the following papers/articles written about CzeDLex:

    Mírovský, J., Synková, P., Rysová, M., and L. Poláková: CzeDLex – A Lexicon of Czech Discourse Connectives. In: The Prague Bulletin of Mathematical Linguistics, No. 109, Univerzita Karlova, Prague, Czech Republic, ISSN 0032-6585, pp. 61-91, Oct 2017.

    Synková, P., Rysová, M., Poláková, L. and J. Mírovský: Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus. In: Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, Computing Society of the Philippines, Cebu, Philippines, 2017.

    Mírovský, J., Synková, P., Rysová, M., and L. Poláková: Designing CzeDLex – A Lexicon of Czech Discourse Connectives. In: Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, Kyung Hee University, Seoul, Korea, ISBN 978-89-6817-428-5, pp. 449-457, 2016.

    References

    Synková, P., Poláková, L., Mírovský, J., and M. Rysová: CzeDLex 0.6. Data/software, ÚFAL, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-3074, Dec 2019.

    Pajas, P. and J. Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

    Rysová, M., Synková, P., Mírovský, J., Hajičová, E., Nedoluzhko, A., Ocelák, R., Pergler, J., Poláková, L., Scheller, V., Zdeňková, J. and Š. Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016.