LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
http://lindat.mff.cuni.cz:80/repository/xmlui
The LINDAT/CLARIAH-CZ digital repository system captures, stores, indexes, preserves, and distributes digital research material.2024-03-08T07:44:15ZDe Latinae Linguae Reparatione treebank
http://hdl.handle.net/11234/1-5438
De Latinae Linguae Reparatione treebank
Gamba, Federica; Cecchini, Flavio Massimiliano
This corpus contains the text of De Latinae Linguae Reparatione authored by Marcus Antonius Sabellicus (1436–1506), annotated with respect to lemmas, part-of-speech tags, morphological features and syntactic dependencies according to the typological formalism of Universal Dependencies (UD).
2024-01-17T00:00:00ZGrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset
http://hdl.handle.net/11234/1-5423
GrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset
Mayer, Jiří; Straka, Milan; Hajič jr., Jan; Pecina, Pavel
The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et al., 2023, https://doi.org/10.1007/s10032-023-00432-z .
The GrandStaff-LMX dataset contains MusicXML and Linearized MusicXML encodings of all systems from the original datase, suitable for evaluation with the TEDn metric. It also contains the GrandStaff official train/dev/split.
2024-02-12T00:00:00ZOLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus
http://hdl.handle.net/11234/1-5419
OLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus
Mayer, Jiří; Straka, Milan; Hajič jr., Jan; Pecina, Pavel
OLiMPiC: OpenScore Lieder Linearized MusicXML Piano Corpus is a dataset containing synthetic and scanned images of pianoform music scores. The scores and the scanned images originate from the OpenScore Lieder Corpus https://github.com/OpenScore/Lieder .
OLiMPiC contains the scores in MusicXML and Linearized MusicXML encoding, suitable for evaluation with the TEDn metric. The official train/dev/test split is also provided.
2024-02-12T00:00:00ZAlbNews Albanian Topic Modeling
http://hdl.handle.net/11234/1-5411
AlbNews Albanian Topic Modeling
Çano, Erion
AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper:
Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028.
2024-02-07T00:00:00ZESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)
http://hdl.handle.net/11234/1-5415
ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)
Macháček, Dominik; Žilinec, Matúš; Bojar, Ondřej
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.
2024-02-05T00:00:00ZDiakorp v6: diachronic corpus of Czech
http://hdl.handle.net/11234/1-5413
Diakorp v6: diachronic corpus of Czech
Kučera, Karel; Řehořková, Anna; Stluka, Martin
Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz
2015-12-18T00:00:00ZParCzech 4.0
http://hdl.handle.net/11234/1-5360
ParCzech 4.0
Kopp, Matyáš
The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current 9th term (2021-Jul 2023). The protocols are provided in their original HTML format, Parla-CLARIN TEI format. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files.
The audio files in this corpus are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404).
This corpus covers the same period as ParlaMint-CZ corpus v4.0 (http://hdl.handle.net/11356/1860). ParCzech corpus follows and extends the ParlaMint schema. Both annotated and non-annotated versions include hypertext references to voting and parliamentary prints. In addition to ParlaMint's recommendation, the annotated version contains source audio alignment, PDT xtag, and more detailed CNEC2.0 named entity categorization.
2024-01-31T00:00:00ZAudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic
http://hdl.handle.net/11234/1-5404
AudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic
Kopp, Matyáš
This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing.
Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar.
Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files.
2024-01-01T00:00:00ZKUK 0.0
http://hdl.handle.net/11234/1-5363
KUK 0.0
Hladká, Barbora; Cinková, Silvie; Kuk, Michal; Mírovský, Jiří; Novotná, Tereza; Zahálková, Kristýna Nguyen
KUK 0.0 is a pilot version of a corpus of Czech legal and administrative texts designated as data for manual and automatic assessment of accessibility (comprehensibility or clarity) of Czech legal texts.
2023-12-31T00:00:00ZHWC2023 –Hamburg.de Website Corpus 2023
http://hdl.handle.net/11372/LRT-5288
HWC2023 –Hamburg.de Website Corpus 2023
Rüdiger, Jan Oliver
A petition for a referendum (called: "Schluss mit Gendersprache in Verwaltung und Bildung" / eng.: "abolition of gender language in administration and education") was formed in Hamburg in February 2023. The project "Empirical Gender Linguistics" at the "Leibniz Institute for the German Language" took this as an opportunity to completely scrap the "https://www.hamburg.de" website (except the list of ships in the Port of Hamburg and the yellow page). The Hamburg.de website is the central digital contact point for citizens. The scraped texts were cleaned, processed and annotated using http://www.CorpusExplorer.de (TreeTagger - POS/Lemma information).
We use the corpus to analyze the use of words with gender signs.
2023-03-06T00:00:00Z