Posts of German PC Games Online Forum
Please use the following text to cite this item or export to a predefined format:
Kissling, Jürg, 2020,
Posts of German PC Games Online Forum, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3293.
Authors
Item identifier
Demo URL
Date issued
2020-02
Size
126104641 tokens
Language(s)
Description
Contains linguistic annotated data from the Online-Forum PC Games (https://forum.pcgames.de). The forum is concerned about gaming. All posts (approx. 2.4 mio) where scraped in April 2019 (details see Kissling 2019), resulting in 120 mio tokens of almost 70'000 authors. The data is saved in a SQL-database and can be accessed using eg. pg_restore. The database itself and the tables of the database contain detailed self-descriptions.
In this database you find tokenized, part-of-speech-tagged and party lemmatized information of every token in the forum and its metadata (usernames and their location in the forum structure, e.g. which post(s), thread, subforum it belongs to). The order of the words in a post cannot be reconstructed with this corpus. Usernames were replaced with author_ids to protect the personal rights of the post authors.
Additional information:
As this corpus was analyzed in terms of productivity and language contact of German and English (Kissling 2020), there is additional information about German base forms found in present day English, mainly focussing on the formula "German_verb_stem + -en = English verb infinitive". Therefore the API of the Oxford Dictionary of English was used. You will find the results of the API request done with Oxford Dictionary of English in the table infinitives. The corpus can be used without using this information, too.
Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel on 2019-09-10. This database contains all of the primary corpus of Kissling (2020).
Sources:
Kissling, J. (2019). Computerunterstütztes Verfahren zur Erhebung eigener Textkorpus-Daten. Methodenentwicklung und Anwendung auf 2.4 Mio. Posts des Forums PC Games.de [certification thesis]. Universität Basel.
Kissling, J. (2020). Produktivität englischer Verben im Deutschen [master thesis]. Universität Basel.
The used scraper is available on github: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum
Publisher
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- 0_README.txt
- Size
- 4.34 KB
- Format
- text/plain
- Description
- Description of the sql-file
- MD5
- 0077e7229f1f119f44406e95ffb203ec

This SQL-Dump contains linguistic annotated data from the Online-Forum PC Games (https://forum.pcgames.de). All posts (approx. 2.4 mio) where scraped on in April 2019 (details see Kissling 2019 and the github-URL below), resulting in 120 mio tokens of almost 70'000 authors. In this database you find tokenized, part-of-speech-tagged and party lemmatized information of the posts and metadata like authors and the location of the post in the forum structure. Lastly, in the table infinitives, you will find the results of the API request done with Oxford Dictionary of English. The order of the words in a post cannot be reconstructed with this database. Usernames were replaced with author_ids. Additional information: As this corpus was analyzed in terms of productivity and language contact of German and English (Kissling 2020), there is additional information about German base forms found in present day English, mainly focussing on the formula "German_verb_stem + -en = English verb infinitive". Therefore the API of the Oxford Dictionary of English was used. You will find the results of the API request done with Oxford Dictionary of English in the table infinitives. The corpus can be used without using this information, too. The scraper is available on github: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel on 2019-09-10. This database contains all of the primary corpus. All tables of the database are described in following lines: - authors: contains a list of all author_ids - forum_has_subforums: The two columns in this table can be used to assign which subforum belongs to which forum. Use (join) e.g. in combination with the tables forums and subforums. - forums: Contains the names of the forums. - infinitives: Contains the ID (inf_id), the German infinitives (lemma), the lemma to be requested in the English Oxford Dictionary ( . . .
- Name
- posts_German_PC_Games_online_forum.sql
- Size
- 437.35 MB
- Format
- application/octet-stream
- Description
- sql dump of the corpus
- MD5
- 92d4d397befe0d9e552ef421c5eb6018

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

