Corpus from the Aozora Bunko Library
Please use the following text to cite this item or export to a predefined format:
Rohacek, Jakub, 2024,
Corpus from the Aozora Bunko Library, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5682.
Authors
Item identifier
Project URL
Date issued
2024
Size
85292491 words
Language(s)
Description
This corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom python script was used to compile it from its official GitHub directory in order to fit specific requirements. It excluded any text currently not freely available in the public domain and organized the output into approximately same-sized text files. Furthermore, they contain an XML structure using tags to denote individual documents (books) as well as provide basic bibliographic information about their author, year, and title.
Publisher
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- aozora_bunko.tar.gz
- Size
- 162.34 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- dafcf5de5472ea21b760d13b47c4a24d

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

