Show simple item record

 
dc.contributor.author Rohacek, Jakub
dc.date.accessioned 2025-02-03T10:50:20Z
dc.date.available 2025-02-03T10:50:20Z
dc.date.issued 2024
dc.identifier.uri http://hdl.handle.net/11234/1-5682
dc.description This corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom python script was used to compile it from its official GitHub directory in order to fit specific requirements. It excluded any text currently not freely available in the public domain and organized the output into approximately same-sized text files. Furthermore, they contain an XML structure using <doc> tags to denote individual documents (books) as well as provide basic bibliographic information about their author, year, and title.
dc.language.iso jpn
dc.publisher Masaryk University, NLP Centre
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.source.uri https://nlp.fi.muni.cz/projekty/aozora
dc.subject Aozora
dc.subject Bunko
dc.subject Corpus
dc.subject Japanese
dc.subject Literature
dc.title Corpus from the Aozora Bunko Library
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Jakub Rohacek 514220@mail.muni.cz Masaryk University, NLP Centre
size.info 85292491 words
files.size 170227941
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
aozora_bunko.tar.gz
Size
162.34 MB
Format
application/x-gzip
Description
Aozora Bunko Corpus
MD5
dafcf5de5472ea21b760d13b47c4a24d
 Download file  Preview
 File Preview  
    • part3.txt101 MB
    • part5.txt54 MB
    • part2.txt114 MB
    • part4.txt110 MB
    • part1.txt102 MB

Show simple item record