Corpus List
TEITOK is a versatile corpus platform, that can handle many different types of corpora, containing potentially much more than just plain text. To make it easier to encounter relevant corpus data, the corpora in the TEITOK@LINDAT corpus list are listed with an indication of their (primary) characteristics concerning both their status, and the type of data the contain. Below is an overview of the distinctions made in the list.
Corpus Status
- Live Corpora - Live corpora are corpora that being modified. This includes monitor corpora that are periodically expanded with new files, searchable versions of corpora under development, typically automatically generated from their Git repository, or corpora that are maintained in TEITOK, and which get corrected, expanded, and modified over time. Live corpora always provide you with the most up-to-date version of the corpus, but as a result, search results in live corpora are not reproducible.
- Stable Corpora - Stable corpora are the more traditional, unmutuble type of corpora. This includes TEITOK corpora that were generated from external sources, mostly from items from the LINDAT repository, and stable snapshot of live corpora.
Corpus Type
- LRL Corpora (Less Resourced Languages) - LRL corpora are corpora whose objective it is to provide corpus data for languages for which little or no other (annotated) corpus data exist. This includes corpora focussing on an LRL language, as well as multilingual corpora that include various LRL languages (potentiallly alongside major languages)
- Parallel Corpora - Parallel corpora are corpora with multiple translations or versions of the same text, that have been aligned at the level of their paragraphs or sentences. Parallel corpora are intended to allow you to compare the different versions
- Treebanks - Treebanks are corpora that provide (typically) manually corrected syntactic analyses of sentences. Most treebanks in TEITOK are dependency treebanks that specify syntactic relations between words in the sentence as opposed to constituency treebanks that provide traditional syntactic trees
- HTR Corpora (Handwritten Text Recognition) - HTR corpora are corpora that were generated from handwritten texts, and are aligned with their original facsimile images
- OCR Corpora (Optical Character Recognition) - OCR corpora are like HTR corpora, but made from printed material
- Learner Corpora - Learner corpora are corpora consisting of texts created by language learners, typically with an indication of the errors made by the learners as well as the corrected alternative
- Spoken Corpora - Spoken corpora are corpora created from spoken data, typically time-aligned with the original sound files
- Parliamentary Corpora - Corpora containing data from parliamentary texts, typically transcripts of parliamentary sessions.