The SiR corpus
The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations. For example, in the sentence Jak už vědci uvedli při prvním kole vykopávek, jde pro ně o záhadu. [As the scientists already stated during the first round of excavations, it is a mystery to them.], the citation phrase uvedli [stated] refers to the citation source vědci [scientists], who provided the information. The manual annotation of the citations was organized as an annotation task for students of FSV UK. The students marked and linked citation phrases with citation sources and decided on a type of the source. In total, 290 students annotated 1 718 articles (published as SiR 1.0 in the Lindat/CLARIAH-CZ repository). Articles that were double- or triple-annotated (589 out of 1 718) are available here in Teitok for searching. Details on the annotation task can be found here.
The citation sources are classified into the following categories:
- unnamed
- anonymous (in the articles marked as
anonymous
) - anonymous partially (
anonymous-partial
)
- anonymous (in the articles marked as
- named
- official - institutional affiliation
- political (
official-political
) - non-political (
official-non-political
)
- political (
- unofficial (
unofficial
)
- official - institutional affiliation
Also, the articles were automatically processed with linguistic procedures UDPipe for morphology and surface syntax, and with NameTag for named entities recognicion.
Browsing the corpus
The annotated articles (documents) can be browsed using the following filters:
- Annotation quality
Some of the articles were annotated by several students for measuring the inter-anotator agreement. The version 1.0 of the corpus consists of articles annotated twice or thrice. Annotations in the triple-annotated articles were subsequently checked and fixed by a fourth annotater, see 46 articles (Browse > Annotation quality > xxx
). Anotations in the double-annotated articles were not manually checked, therefore there are only those annotations in the corpus that both annotators agreed upon, see 543 articles (Browse > Annotation quality > xx
). - Author
- Section
- Tag
The automatically added linguistic information can be accessed via links at the end of each document (Dependencies
and Named Entities
).
This information can be used in searching in the articles, for example for identifying masculine animate citation sources:
Searching in the corpus
The CQL language is used for searching in the corpus. Several illustative examples of queries with descriptions can be found in the Search section.
GitHub
You can contact the authors with any questions or comments on the e-mail address zdroj@ufal.mff.cuni.cz, or you may add an issue on GitHub.
Publications
- Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec. Annotating Attribution in Czech News Server Articles. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 1817–1823, Marseille, France 20-25 June 2022. pdf
Acknowledgement
The work on the corpus has been financed from the project Signál a šum v éře Žurnalistiky 5.0 - komparativní perspektiva novinářských žánrů automatizovaných obsahů. The manual annotation of the citations was organized as an annotation task for students attending the courses "Digital Communication and Sources" and "Ethics for Journalists" at the Faculty of Social Sciences, Charles Univeristy to practice selected theoretical journalistic concepts.