Objectives and Content of MuNeCo

MuNeCo provides comparable, significantly sized corpus data for a wide arrange of languages, with a target of at least 1 million words for each language. For this, it harvests newspaper articles in as many languages as possible, since newspapers are amongst the most widely available, reliable data in many smaller languages, since often the creation of an online newspaper in a local language is part of a language revival or reinforcement effort. Furthermore, newspaper articles tend to be general language across a number of different domains, that tend to be relatively representative for a language. Apart from providing online resources, MuNeCo also functions as a permanent storage for newspapers article, since unfortuantely, online newspapers in local languages often are short lived.  

Apart from language of the article, the corpus also keeps track of the country of origin of the newspaper, in order to indicate potential local dialects. For languages that are spoken across multiple countries, the corpus attempts to include newspapers from several of the countries where available. And in principle, newspapers were selected from countries where the language is an official language, and selecting newspapers that are least likely to provide translations of newspaper articles in other languages. Only in cases where local newspapers were hard to come by did we resort to international publications such as the Voice of America.

Due to copyright concerns, the corpus data in MuNeCo can be searched, but the context is not shown in MuNeCo. Rather, for the full context a link is provided to the original newspapers article. In cases where the original is no longer available, either because the original newspapers removed the article or changed it URL, or the newspaper disappeared completely, a limited context is shown inside MuNeCo.

Wherever possible, the corpus data in MuNeCo are linguistically annotated with part-of-speech tags, lemmas, and dependency relations, all preferentially in the Universal Dependencies framework. Which parser or tagger was used for each language can be found in the language description. We will gradually provide linguistic annotation for additional languages, once annotation tools become available.

The articles in this corpus were automatically harvested and treated. And the identification of the language of each article is based on the language of the newspaper as a whole or a language-specific section. And for the identification of each language, the official ISO 639-3 codes and names were used. It is possible that there are incomplete articles, articles that are not actually in the same language as the rest of the newspaper, or newspapers of which the language was not correctly identified - esp. in those cases where the newspaper is written in something that linguistically is considered a macrolanguage. 

If there are any annotation tools we missed, or newspapers in languages we missed, we would be grateful for any contributions. Equally, if there are any articles for which their language is incorrectly identified, or there are other problems with the source, we would appreciate the feedback.