ParCzech 4.0 is a corpus of stenographic protocols that record the Czech Chamber of Deputies' (PS) meetings held in from 25 November 2013 to 26 July 2023. The menu (on the left) border is blue that visualizes the latest version of a ParCzech corpus while gray visualizes the previous versions, see e.g., ParCzech PS7 1.0. In the future, we will use one more color, namely the red one, to distinguish live corpora from stable corpora.
The following terms in the parliamentary procedures are relevant for the corpus: during a term (volební období), there are meetings (schůze) which are a group of sittings (projednávání) and which typically take place in more than one day. Each meeting has its own agenda and an agenda item (bod schůze) is discussed in speeches (promluvy) that can be made at more than one sitting.
The protocols are automatically enriched with the morphological, syntactic and named-entity annotations using the procedures UDPipe 2 and NameTag 2. They are provided in their original HTML format and Parla-CLARIN TEI format. The audio recordings are available as well and they are aligned with the texts in the annotated TEI files (with the interfix ana
). ParCzech 4.0 including both the text and audio files is downloadable from the LINDAT/CLARIAH-CZ repository, see Download
.
To make the corpus accessible in a more user friendly way than the Parliament publishes the protocols, we use the web-based platform TEITOK that enables to (1) browse the corpus (see Browse in the menu) and (2) search the corpus using CQL (see CQL Search
).
Browse
The documents (= protocols) are named in a way that describes the hierarchy of terms, meetings, sittings, agenda item ids in a given sitting and agenda item ids in a given meeting, e.g., ps2013(term)-019(meeting)-01(sitting)-002(agenda item of the current sitting)-019 (agenda item of the current meeting). For illustration, eight different agenda items were discussed at the sitting on the 21st of October 2014 in the following order: 000, 060, 019, 016, 048, 128, 004, 005. The meetings are numbered from 001
onwards for each term, sittings from 01
onwards for each meeting, agenda item ids from 001
onwards for each sitting and agenda items from 001
onwards for each meeting.
For single documents, the options at the bottom of each page enable to
- download them as XML files (original TEI files and files for TEITOK) and text files
- view their morphologic annotation (point the mouse at any word in the document and a pop-window with its morphological annotation appears)
- view their syntactic annotations (see Dependencies)
- run WaveSurfer for their audio visualization (see WaveSurfer)
- view their named entity annotations (see Named Entities)
Search
- CQL Search: TEITOK uses the Corpus Query Processor to query corpora in the CQP query language
Acknowledgement
This work has been using language resources and tools developed and/or stored and/or distributed by the LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (projects LM2018101 and LM2023062).