This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Preamble 1.0

Please use the following text to cite this item or export to a predefined format:
Hladká, Barbora and Mírovský, Jiří, 2022, Preamble 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-4912.
Date issued
2022-10-14
Size
10173 words,
522 items
Description
Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four language versions of the preamble (Czech, English, French, Polish), each of them annotated with sentence subjects. The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each annotated preamble is represented by the original plain text and a stand-off annotation file.
Acknowledgement
 Files in this item
Name
README.TXT
Size
2.92 KB
Format
text/plain
Description
Text
MD5
bdf82b43c155d3bf117d509e8080d449
Preview
  File Preview
    ============
    Preamble 1.0
    ============
    
    
    Authors
    =======
    
    Barbora Hladká (hladka@ufal.mff.cuni.cz)
    Jiří Mírovský (mirovsky@ufal.mff.cuni.cz)
    
    Introduction
    ============
    
    Preamble 1.0 is a multilingual annotated corpus of the preamble of the
    EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL
    of 16 December 2020 on a general regime of conditionality for the protection
    of the Union budget. The corpus consists of four language versions of the
    preamble (source texts downloaded from the following web pages):
    
    Czech (https://eur-lex.europa.eu/legal-content/CS/TXT/PDF/?uri=CELEX:32020R2092)
    English (https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32020R2092)
    French (https://eur-lex.europa.eu/legal-content/FR/TXT/PDF/?uri=CELEX:32020R2092)
    Polish (https://eur-lex.europa.eu/legal-content/PL/TXT/PDF/?uri=CELEX:32020R2092)
    
    The language selection is based on languages used in the course NPFL134 (Data
    Analytics for Students of Social Studies and Humanities) at the Institute of
    Formal and Applied Linguistics in the summer semester of 2022
    (https://ufal.mff.cuni.cz/courses/npfl134).
    
    The annotation comprises of annotation of subjects, while an annotated subject
    is always only a single word, i.e., in sentence "European leaders said that...",
    only "leaders" is annotated as a subject. In accordance with this rule, which
    follows the Universal Dependencies framework, each member of a coordinated subject
    is annotated separately, articles are not included.
    
    The annotations come as a result of the work of the NPFL134 course students; each
    preamble was independently annotated by two students and their automatically unified
    results were subsequently curated by an arbiter.
    
    The four language versions contain the following numbers of words and annotated subjects:
    
    Czech: 2283 words, 143 annotated subjects
    English: 2687 words, 139 annotated subjects
    French: 2941 words, 133 annotated subjects
    Polish: 2262 words, 107 annotated subjects
    
    Data Format
    ======== . . .
Name
Preamble1.0.zip
Size
34.14 KB
Format
application/zip
Description
Zip
MD5
6d9825ad77f165a80011e331be0a5e61
Preview
  File Preview
  • Preamble1.0
    • README.TXT2 kB
    • data
      • fr
        • preamble_fr.ann4 kB
        • preamble_fr.txt20 kB
      • en
        • preamble_en.ann4 kB
        • preamble_en.txt17 kB
      • pl
        • preamble_pl.ann3 kB
        • preamble_pl.txt18 kB
      • cs
        • preamble_cs.ann4 kB
        • preamble_cs.txt17 kB