This RESTful service allows to define a sub-corpus from different annotated corpora. The service includes a POS tag harmonisation process where original tags are converted to EAGLES/Parole format. The eventual sub-corpus is indexed using the IMS CWB tool. The user receives an ID which can be used by the CQP service to exploit the sub-corpus.
This RESTful service accesses part of the Hemeroteca Digital de l’Arxiu Municipal de Girona (digital press archive from the Girona city council), specifically Catalan press from 2003. The service uses the SRU protocol.
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
Search engine for the neologisms database of the NEOROM network. The network collects neologisms used in the press written in Romance languages from 2005 onwards.