GitHub repository

The SiR corpus

The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations. For example, in the sentence Jak už vědci uvedli při prvním kole vykopávek, jde pro ně o záhadu. [As the scientists already stated during the first round of excavations, it is a mystery to them.], the citation phrase uvedli [stated] refers to the citation source vědci [scientists], who provided the information. The manual annotation of the citations was organized as an annotation task for students of FSV UK. The students marked and linked citation phrases with citation sources and decided on a type of the source. In total, 290 students annotated 1 718 articles (published as SiR 1.0 in the Lindat/CLARIAH-CZ repository). Articles that were double- or triple-annotated (589 out of 1 718) are available here in Teitok for searching. Details on the annotation task can be found here.

The citation sources are classified into the following categories:

unnamed
- anonymous (in the articles marked as anonymous)
- anonymous partially (anonymous-partial)
named
- official - institutional affiliation
  - political (official-political)
  - non-political (official-non-political)
- unofficial (unofficial)

Also, the articles were automatically processed with linguistic procedures UDPipe for morphology and surface syntax, and with NameTag for named entities recognicion.

Browsing the corpus

The annotated articles (documents) can be browsed using the following filters:

Annotation quality
Some of the articles were annotated by several students for measuring the inter-anotator agreement. The version 1.0 of the corpus consists of articles annotated twice or thrice. Annotations in the triple-annotated articles were subsequently checked and fixed by a fourth annotater, see 46 articles (Browse > Annotation quality > xxx). Anotations in the double-annotated articles were not manually checked, therefore there are only those annotations in the corpus that both annotators agreed upon, see 543 articles (Browse > Annotation quality > xx).
Author
Section
Tag

The automatically added linguistic information can be accessed via links at the end of each document (Dependencies and Named Entities). This information can be used in searching in the articles, for example for identifying masculine animate citation sources:

Searching in the corpus

The CQL language is used for searching in the corpus. Several illustative examples of queries with descriptions can be found in the Search section.

GitHub

You can contact the authors with any questions or comments on the e-mail address zdroj@ufal.mff.cuni.cz, or you may add an issue on GitHub.

Publications

Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec. Annotating Attribution in Czech News Server Articles. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 1817–1823, Marseille, France 20-25 June 2022. pdf

Acknowledgement

The work on the corpus has been financed from the project Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses "Digital Communication and Sources" and "Ethics for Journalists" at the Faculty of Social Sciences, Charles Univeristy to practice selected theoretical journalistic concepts.