EN | CZ

The SiR corpus

The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations. For example, in the sentence Jak už vědci uvedli při prvním kole vykopávek, jde pro ně o záhadu. [As the scientists already stated during the first round of excavations, it is a mystery to them.], the citation phrase uvedli [stated] refers to the citation source vědci [scientists], who provided the information. The manual annotation of the citations was organized as an annotation task for students of FSV UK. The students marked and linked citation phrases with citation sources and decided on a type of the source. In total, 290 students annotated 1 718 articles (published as SiR 1.0 in the Lindat/CLARIAH-CZ repository). Articles that were double- or triple-annotated (589 out of 1 718) are available here in Teitok for searching. Details on the annotation task can be found here.

The citation sources are classified into the following categories:

Also, the articles were automatically processed with linguistic procedures UDPipe for morphology and surface syntax, and with NameTag for named entities recognicion.

Browsing the corpus

The annotated articles (documents) can be browsed using the following filters:

The automatically added linguistic information can be accessed via links at the end of each document (Dependencies and Named Entities). This information can be used in searching in the articles, for example for identifying masculine animate citation sources:

Searching in the corpus

The CQL language is used for searching in the corpus. Several illustative examples of queries with descriptions can be found in the Search section.

GitHub

You can contact the authors with any questions or comments on the e-mail address zdroj@ufal.mff.cuni.cz, or you may add an issue on GitHub.

Publications

Acknowledgement

The work on the corpus has been financed from the project Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses "Digital Communication and Sources" and "Ethics for Journalists" at the Faculty of Social Sciences, Charles Univeristy to practice selected theoretical journalistic concepts.