============ Preamble 1.0 ============ Authors ======= Barbora Hladká (hladka@ufal.mff.cuni.cz) Jiří Mírovský (mirovsky@ufal.mff.cuni.cz) Introduction ============ Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 16 December 2020 on a general regime of conditionality for the protection of the Union budget. The corpus consists of four language versions of the preamble (source texts downloaded from the following web pages): Czech (https://eur-lex.europa.eu/legal-content/CS/TXT/PDF/?uri=CELEX:32020R2092) English (https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32020R2092) French (https://eur-lex.europa.eu/legal-content/FR/TXT/PDF/?uri=CELEX:32020R2092) Polish (https://eur-lex.europa.eu/legal-content/PL/TXT/PDF/?uri=CELEX:32020R2092) The language selection is based on languages used in the course NPFL134 (Data Analytics for Students of Social Studies and Humanities) at the Institute of Formal and Applied Linguistics in the summer semester of 2022 (https://ufal.mff.cuni.cz/courses/npfl134). The annotation comprises of annotation of subjects, while an annotated subject is always only a single word, i.e., in sentence "European leaders said that...", only "leaders" is annotated as a subject. In accordance with this rule, which follows the Universal Dependencies framework, each member of a coordinated subject is annotated separately, articles are not included. The annotations come as a result of the work of the NPFL134 course students; each preamble was independently annotated by two students and their automatically unified results were subsequently curated by an arbiter. The four language versions contain the following numbers of words and annotated subjects: Czech: 2283 words, 143 annotated subjects English: 2687 words, 139 annotated subjects French: 2941 words, 133 annotated subjects Polish: 2262 words, 107 annotated subjects Data Format =========== The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each annotated preamble is represented by the original plain text and a stand-off annotation file. The annotation file carries a single type of information: the annotation of subjects (text spans marked with tag "SUBJECT"). Licence ======= The corpus Preamble 1.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence. For more information and updates, see https://ufal.mff.cuni.cz/courses/npfl134/subjann Acknowledgement =============== The work on the corpus was financed by the by the 4EU+ Alliance under grant agreement No 2021_F3_10. The manual annotation of the subjects was organized as an annotation task for students attending the course NPFL134 (Data Analytics for Students of Social Studies and Humanities) at the Institute of Formal and Applied Linguistics in the summer semester of 2022.