Annotate is a web and desktop application that should simplify the process of transforming photos of manuscripts to a browsable collection. It also allows users to annotate parts of the displayed images.
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the Constitutional Court of the Czech Republic.
The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
This editor was developed especially for the needs of the KAMOKO project (https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3261). The editor allows the quick entry of example sentences and sentence variants as well as the corresponding speaker ratings.
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT, the spoken language corpus (PDTSC) and corpus of user-generated texts in the project Faust). It contains over 14500 valency frames for almost 8500 verbs which occurred in the PDT, PCEDT, PDTSC and Faust corpora. In addition, there are nouns, adjectives and adverbs, linked from the PDT part only, increasing the total to over 17000 valency frames for 13000 words. All the corpora have been published in 2020 as the PDT-C 1.0 corpus with the PDT-Vallex 4.0 dictionary included; this is a copy of the dictionary published as a separate item for those not interested in the corpora themselves. It is available in electronically processable format (XML), and also in more human readable form including corpus examples (see the WEBSITE link below, and the links to its main publications elsewhere in this metadata). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives. It replaces the previously published unversioned edition of PDT-Vallex from 2014.
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new manually annotated layer of language description, above the existing layers of the PDT, and it portrays linguistic phenomena from the perspective of discourse structure and coherence. and GACR P406/12/0658, GACR P406/2010/0875, GACR 405/09/0729, Ministry of Education ME10018, Ministry of Education LM2010013
Input data, individual experimental annotations, and a complete and detailed overview of the measured results related to the experiment described in the referenced paper.
SiR 1.0 is a corpus of Czech articles published on iRozhlas, a news server of a Czech public radio (https://www.irozhlas.cz/). It is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution of citation phrases and sources. The sources are classified into several classes of named and unnamed sources.
The corpus consists of three parts, depending on the quality of the annotations:
(i) triple-annotated articles: 46 articles (933 sentences, 13 242 words) annotated independently by three annotators and subsequently curated by an arbiter,
(ii) double-annotated articles: 543 articles (12 347 sentences, 180 622 words) annotated independently by two annotators and automatically unified,
and (iii) single-annotated articles: 1 129 articles (29 610 sentences, 421 131 words) annotated each only by a single annotator.
The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each article is represented by the original plain text and a stand-off annotation file.
Please cite the following paper when using the corpus for your research: Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec. Annotating Attribution in Czech News Server Articles. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 1817–1823, Marseille, France 20-25 June 2022.