This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
Please use the following text to cite this item or export to a predefined format:
Rohacek, Jakub, 2024, Corpus from the Aozora Bunko Library, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5682.
dc.contributor.authorRohacek, Jakub
dc.date.accessioned2025-02-03T10:50:20Z
dc.date.available2025-02-03T10:50:20Z
dc.date.issued2024
dc.descriptionThis corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom python script was used to compile it from its official GitHub directory in order to fit specific requirements. It excluded any text currently not freely available in the public domain and organized the output into approximately same-sized text files. Furthermore, they contain an XML structure using tags to denote individual documents (books) as well as provide basic bibliographic information about their author, year, and title.
dc.identifier.urihttp://hdl.handle.net/11234/1-5682
dc.language.isojpn
dc.publisherMasaryk University, NLP Centre
dc.rightsCreative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.labelPUB
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.source.urihttps://nlp.fi.muni.cz/projekty/aozora
dc.subjectAozora
dc.subjectBunko
dc.subjectCorpus
dc.subjectJapanese
dc.subjectLiterature
dc.titleCorpus from the Aozora Bunko Library
dc.typecorpus
local.brandingLINDAT / CLARIAH-CZ
local.contact.personJakub Rohacek 514220@mail.muni.cz Masaryk University, NLP Centre
local.files.count1
local.files.size170227941
local.has.filesyes
local.language.nameJapanese
local.size.info85292491 words
metashare.ResourceInfo#ContentInfo.mediaTypetext
This item isPublicly Available
and licensed under:
 Files in this item
Name
aozora_bunko.tar.gz
Size
162.34 MB
Format
application/x-gzip
Description
Aozora Bunko Corpus
MD5
dafcf5de5472ea21b760d13b47c4a24d
Preview
  File Preview
    • part3.txt101 MB
    • part5.txt54 MB
    • part2.txt114 MB
    • part4.txt110 MB
    • part1.txt102 MB