Multilingual Newspaper Corpus

LINDAT Newspaper Articles in Many Languages since 2016

MuNeCo is a corpus that presents newspaper articles in a range of different languages, treated as a linguistic corpus. Although no languages are excluded, it is specifically aimed at those languages for which few or no corpus data are available.

New languages are added over time. A list of all the newspapers currently treated can be found in the newspapers section.

All texts in this corpus are harvested from publicly available online newspapers, and each text always goes accompanied by the URL it was originally harvested from. The copyright of all texts lies with the newspaper, which is why the context of search results is not shown in MuNeCo, but rather link to the original newspaper article (where still available).

 

TEITOK

MuNeCo is built on the TEITOK framework. TEITOK is a framework for the development and distribution of corpora, in which each article is a separate file in XML format ideally in the TEI/XML format. The corpus documents in TEITOK consists not only of words, but also includes the original typesetting of the text.

The TEITOK system builds a searchable corpus out of all the individual XML files using the Corpus Workbench. To make searches faster, each language is kept as a separate corpus.

Linguistic Annotation

Where possible. all documents in MuNeCo have been automatically adorned with linguistic annotations, using the format of the Universal Dependencies. The description page of each language specifies whether and how the documents in that language were annotated. The primary source for anntoation is UDPIPE, which provides tagging and parsing for a large number of languages for which there is a UD treebank. For languages for which no model is available in UDPIPE, other NLP tools were used when available. The NLP tools used for each language are listed in the language pages.