dc.contributor.author | Rohacek, Jakub |
dc.date.accessioned | 2025-02-03T10:50:20Z |
dc.date.available | 2025-02-03T10:50:20Z |
dc.date.issued | 2024 |
dc.identifier.uri | http://hdl.handle.net/11234/1-5682 |
dc.description | This corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom python script was used to compile it from its official GitHub directory in order to fit specific requirements. It excluded any text currently not freely available in the public domain and organized the output into approximately same-sized text files. Furthermore, they contain an XML structure using <doc> tags to denote individual documents (books) as well as provide basic bibliographic information about their author, year, and title. |
dc.language.iso | jpn |
dc.publisher | Masaryk University, NLP Centre |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.source.uri | https://nlp.fi.muni.cz/projekty/aozora |
dc.subject | Aozora |
dc.subject | Bunko |
dc.subject | Corpus |
dc.subject | Japanese |
dc.subject | Literature |
dc.title | Corpus from the Aozora Bunko Library |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Jakub Rohacek 514220@mail.muni.cz Masaryk University, NLP Centre |
size.info | 85292491 words |
files.size | 170227941 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- aozora_bunko.tar.gz
- Size
- 162.34 MB
- Format
- application/x-gzip
- Description
- Aozora Bunko Corpus
- MD5
- dafcf5de5472ea21b760d13b47c4a24d