In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
This article deals with the issues of introducing web resources into subject literary bibliographies. This issue is at first analyzed on the general level as the methodological challenge of online sources systematic introduction into the context of managing current bibliography and then case studies follow targeted on specific problems of internet material bibliography treatment. Firstly one discusses the issues related to bibliographical processing of online documents (web pages, online journals, etc.), which are in complex and variable relations to those available in print. An attention is paid to the methodological issues, in particular providing criteria of web resources selection. The need for archivization of bibliographically processed materials which we assess as the crucial element of any systematic bibliographical processing of Internet materials will be also highlighted. In the following part study presents the preliminarily classification of the new specific genres of internet content: blogs and literary forums. Firstly the Polish literary blogosphere is analysed and preliminarily typology of this document type is introduced. Later a phenomenon of literary forums is taken into consideration. Based on an example of poetry forum Nieszuflada.pl more detailed quantitative analysis of this resource type is given and issues of authorship attribution in the digital environment are discussed.
Creative Commons is a copyright movement that supports the building of a public domain by providing an alternative to the automatic all rights reserved copyright to some rights reserved. There are four major conditions of the Creative Commons: Attribution (BY), requiring attribution to the original author; Share Alike (SA), allowing derivative works under the same or a similar license (later or jurisdiction version); Non-Commercial (NC), requiring that the work not be used for commercial purposes; and No Derivative Works (ND), allowing only an original work without derivatives. and Libor Coufal.