This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Automatically Annotated Corpora with Stanza and UDPipe for Czech, English, and Greek

Please use the following text to cite this item or export to a predefined format:
Diamantopoulos, Konstantinos, 2026, Automatically Annotated Corpora with Stanza and UDPipe for Czech, English, and Greek, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-6120.
Date issued
2026-03-04
Size
30,000,000 sentences
Description
This resource contains six automatically annotated corpora derived from the Leipzig Corpora Collection, covering three languages: Czech, English, and Greek. For each language, two corpora are provided — one annotated with Stanza and one annotated with UDPipe — resulting in two corpora per language and six corpora in total.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
corpora.tar.gz
Size
6.32 GB
Format
application/x-gzip
Description
MD5
4588f035baf7271cf4afb18f38ad9ecb
Preview
  File Preview
    • english_corpus_stanza_complete.conllu5 GB
    • greek_corpus_stanza_complete.conllu8 GB
    • czech_corpus_udpipe_complete.conllu7 GB
    • english_corpus_udpipe_complete.conllu6 GB
    • greek_corpus_udpipe_complete.conllu8 GB
    • ._greek_corpus_stanza_complete.conllu176 B
    • czech_corpus_stanza_complete.conllu7 GB