This SQL-Dump contains linguistic annotated data from the Online-Forum PC Games (https://forum.pcgames.de). All posts (approx. 2.4 mio) where scraped on in April 2019 (details see Kissling 2019 and the github-URL below), resulting in 120 mio tokens of almost 70'000 authors.
In this database you find tokenized, part-of-speech-tagged and party lemmatized information of the posts and metadata like authors and the location of the post in the forum structure. Lastly, in the table infinitives, you will find the results of the API request done with Oxford Dictionary of English.
The order of the words in a post cannot be reconstructed with this database. Usernames were replaced with author_ids.

Additional information:
As this corpus was analyzed in terms of productivity and language contact of German and English (Kissling 2020), there is additional information about German base forms found in present day English, mainly focussing on the formula "German_verb_stem + -en = English verb infinitive". Therefore the API of the Oxford Dictionary of English was used. You will find the results of the API request done with Oxford Dictionary of English in the table infinitives. The corpus can be used without using this information, too.

The scraper is available on github: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum

Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel on 2019-09-10. This database contains all of the primary corpus.

All tables of the database are described in following lines:
- authors: contains a list of all author_ids
- forum_has_subforums: The two columns in this table can be used to assign which subforum belongs to which forum. Use (join) e.g. in combination with the tables forums and subforums.
- forums: Contains the names of the forums.
- infinitives: Contains the ID (inf_id), the German infinitives (lemma), the lemma to be requested in the English Oxford Dictionary (od), and the date of requesting the API of the English Oxford Dictionary (od_date).
- post_has_words: The two columns in this table can be used to assign which word belongs to which post. Use (join) e.g. in combination with the tables words and posts. Be aware that the word order of a post cannot be reconstructed in most cases.
- posts: Contains informations for every post: about the ID of the post (post_id), the thread it belongs to (thread_id), the author (author_id) and the title (if any) (post_title)
- subforum_has_threads: The two columns in this table can be used to assign which thread belongs to which subforum. Use (join) e.g. in combination with the tables threads and subforums.
- subforums: Contains the names of the subforums.
- threads: Contains the information about the thread_id and title of every thread's page. Normally a thread has more than one page, so the first page will have thread name 'xxx-1' and thread_page 1. The second page has name 'xxx-2' and thread_page 2 and so on.
- words: contains the information about every word: its id (word_id), the word itself (column 'word', tokenizer: SoMaJo as described in Proisl & Uhrig, 2016), the Part-of-Speech-Tag (Tagger SoMeWeTa as described in Proisl, 2018; Tagset STTS_IBK described in Beißwenger et al., 2015), if the word is a verb, the column lemma contains its infinitive.

Source:
Beißwenger, M., Bartz, T., Storrer, A., & Westpfahl, S. (2015). EmpiriST 2015: Tagset and guidelines for the PoS tagging of language data from genres of computer- mediated communication / social media. Empirikom-Network.

Kissling, Jürg (2019, April 15). Computerunterstütztes Verfahren zur Erhebung eigener Textkorpus-Daten. Methodenentwicklung und Anwendung auf 2.4 Mio. Posts des Forums PC Games.de (Zertifizierungsarbeit). Universität Basel, Basel.

Kissling, J. (2020). Produktivität englischer Verben im Deutschen [master thesis]. Universität Basel.

Proisl, T. (2018). SoMeWeTa: A part-of-speech tagger for german social media and web texts. In European Language Resources Association (ELRA) (Ed.), Proceedings of the 11th Language Resources and Evaluation Conference (pp. 665–670). European Language Resource Association. https://www.aclweb.org/anthology/L18-1106

Proisl, T., & Uhrig, P. (2016). SoMaJo: State-of-the-art tokenization for German web and social media texts. Proceedings of the 10th Web as Corpus Workshop, 57–62. https://doi.org/10.18653/v1/W16-2607