Show simple item record

 
dc.contributor.author Grác, Marek
dc.date.accessioned 2013-02-26T13:40:06Z
dc.date.available 2013-02-26T13:40:06Z
dc.date.issued 2011
dc.identifier.uri http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
dc.description In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
dc.language.iso ces
dc.publisher Masaryk University, NLP Centre
dc.rights Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/
dc.source.uri http://nlp.fi.muni.cz/projekty/cocb/
dc.subject corpus
dc.subject blogs
dc.subject annotation
dc.subject annotators
dc.subject sentences
dc.subject machine learning
dc.title Corpus of contemporary blogs
dc.type corpus
metashare.ResourceInfo#ContactInfo#PersonInfo.surname Grác
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName Marek
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName Masaryk university, NLP Centre
metashare.ResourceInfo#DistributionInfo.availability restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse academic-nonCommercialUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse noDerivatives
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium downloadable
metashare.ResourceInfo#ValidationInfo.validated True
metashare.ResourceInfo#ContentInfo.mediaType text
metashare.ResourceInfo#TextInfo#SizeInfo.size 10
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit mb
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email grac@fi.muni.cz
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
size.info 10 mb
files.size 4174388
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Distributed under Creative Commons Attribution Required Noncommercial No Derivative Works
Icon
Name
CoCB.zip
Size
3.98 MB
Format
application/zip
Description
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, we split the corpus of contemporary text CBB.blog (1 million tokens) with annotators into senteces. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.
MD5
1abb8d3e784da994fe265c31d95d2fcb
 Download file  Preview
 File Preview  
  • corpus_of_contemporary_blogs
    • 090.txt.anot0152 kB
    • 095.txt.anot0350 kB
    • 095.txt.anot0250 kB
    • 074.txt.anot0350 kB
    • 074.txt.anot0150 kB
    • 079.txt.anot0349 kB
    • 053.txt.anot0353 kB
    • 079.txt.anot0149 kB
    • 053.txt.anot0153 kB
    • 044.txt.anot0254 kB
    • 049.txt.anot0351 kB
    • 023.txt.anot0352 kB
    • 049.txt.anot0151 kB
    • 023.txt.anot0252 kB
    • 028.txt.anot0351 kB
    • 028.txt.anot0251 kB
    • 002.txt.anot0252 kB
    • 002.txt.anot0152 kB
    • 007.txt.anot0352 kB
    • 007.txt.anot0252 kB
    • 081.txt.anot0348 kB
    • 081.txt.anot0248 kB
    • 086.txt.anot0352 kB
    • 060.txt.anot0449 kB
    • 060.txt.anot0350 kB
    • 086.txt.anot0152 kB
    • 065.txt.anot0250 kB
    • 065.txt.anot0150 kB
    • 056.txt.anot0351 kB
    • 030.txt.anot0353 kB
    • 056.txt.anot0151 kB
    • 030.txt.anot0152 kB
    • 035.txt.anot0357 kB
    • 035.txt.anot0257 kB
    • 014.txt.anot0351 kB
    • 014.txt.anot0151 kB
    • 019.txt.anot0351 kB
    • 019.txt.anot0251 kB
    • 093.txt.anot0351 kB
    • 093.txt.anot0151 kB
    • 098.txt.anot0239 kB
    • 098.txt.anot0139 kB
    • 072.txt.anot0252 kB
    • 072.txt.anot0152 kB
    • 077.txt.anot0347 kB
    • 077.txt.anot0147 kB
    • 051.txt.anot0251 kB
    • 051.txt.anot0151 kB
    • 068.txt.anot0354 kB
    • 068.txt.anot0154 kB
    • 047.txt.anot0343 kB
    • 047.txt.anot0143 kB
    • 021.txt.anot0251 kB
    • 021.txt.anot0151 kB
    • 026.txt.anot0350 kB
    • 026.txt.anot0150 kB
    • 005.txt.anot0354 kB
    • 005.txt.anot0154 kB
    • 084.txt.anot0349 kB
    • 084.txt.anot0249 kB
    • 089.txt.anot0351 kB
    • 089.txt.anot0250 kB
    • 063.txt.anot0346 kB
    • 063.txt.anot0246 kB
    • 042.txt.anot0350 kB
    • 042.txt.anot0150 kB
    • 059.txt.anot0351 kB
    • 059.txt.anot0250 kB
    • 033.txt.anot0351 kB
    • 033.txt.anot0251 kB
    • 038.txt.anot0354 kB
    • 038.txt.anot0154 kB
    • 012.txt.anot0250 kB
    • 012.txt.anot0150 kB
    • 017.txt.anot0251 kB
    • 017.txt.anot0151 kB
    • 091.txt.anot0352 kB
    • 096.txt.anot0549 kB
    • 091.txt.anot0152 kB
    • 096.txt.anot0249 kB
    • 070.txt.anot0251 kB
    • 070.txt.anot0151 kB
    • 075.txt.anot0352 kB
    • 075.txt.anot0252 kB
    • 054.txt.anot0354 kB
    • 054.txt.anot0149 kB
    • 045.txt.anot0351 kB
    • 045.txt.anot0251 kB
    • 024.txt.anot0354 kB
    • 024.txt.anot0148 kB
    • 003.txt.anot0353 kB
    • 029.txt.anot0251 kB
    • 029.txt.anot0150 kB
    • 003.txt.anot0153 kB
    • 008.txt.anot0252 kB
    • 008.txt.anot0152 kB
    • 082.txt.anot0352 kB
    • 082.txt.anot0152 kB
    • 087.txt.anot0251 kB
    • 061.txt.anot0346 kB
    • 087.txt.anot0151 kB
    • 061.txt.anot0146 kB
    • 066.txt.anot0353 kB
    • 040.txt.anot0352 kB
    • 040.txt.anot0252 kB
    • 066.txt.anot0153 kB
    • 057.txt.anot0252 kB
    • 031.txt.anot0351 kB
    • 057.txt.anot0152 kB
    • 031.txt.anot0151 kB
    • 036.txt.anot0349 kB
    • 010.txt.anot0351 kB
    • 036.txt.anot0149 kB
    • 010.txt.anot0151 kB
    • 015.txt.anot0349 kB
    • 015.txt.anot0149 kB
    • 094.txt.anot0352 kB
    • 099.txt.anot0649 kB
    • 094.txt.anot0252 kB
    • 073.txt.anot0348 kB
    • 099.txt.anot0149 kB
    • 073.txt.anot0248 kB
    • 078.txt.anot0349 kB
    • 078.txt.anot0149 kB
    • 052.txt.anot0251 kB
    • 052.txt.anot0151 kB
    • 069.txt.anot0351 kB
    • 069.txt.anot0251 kB
    • 043.txt.anot0352 kB
    • 043.txt.anot0152 kB
    • 048.txt.anot0350 kB
    • 022.txt.anot0350 kB
    • 048.txt.anot0150 kB
    • 022.txt.anot0150 kB
    • 027.txt.anot0351 kB
    • 027.txt.anot0151 kB
    • 001.txt.anot0252 kB
    • 001.txt.anot0152 kB
    • 100.txt.anot0323 kB
    • 100.txt.anot0223 kB
    • 006.txt.anot0253 kB
    • 006.txt.anot0153 kB
    • 080.txt.anot0352 kB
    • 080.txt.anot0252 kB
    • 085.txt.anot0351 kB
    • 085.txt.anot0251 kB
    • 064.txt.anot0346 kB
    • 064.txt.anot0246 kB
    • 055.txt.anot0351 kB
    • 055.txt.anot0151 kB
    • 034.txt.anot0353 kB
    • 034.txt.anot0253 kB
    • 039.txt.anot0352 kB
    • 039.txt.anot0252 kB
    • 013.txt.anot0350 kB
    • 013.txt.anot0250 kB
    • 018.txt.anot0252 kB
    • 018.txt.anot0152 kB
    • 092.txt.anot0252 kB
    • 097.txt.anot0547 kB
    • 092.txt.anot0152 kB
    • 097.txt.anot0347 kB
    • 071.txt.anot0251 kB
    • 071.txt.anot0151 kB
    • 076.txt.anot0251 kB
    • 076.txt.anot0151 kB
    • 050.txt.anot0250 kB
    • 050.txt.anot0150 kB
    • 067.txt.anot0351 kB
    • 067.txt.anot0151 kB
    • 046.txt.anot0357 kB
    • 020.txt.anot0352 kB
    • 046.txt.anot0257 kB
    • 020.txt.anot0152 kB
    • 025.txt.anot0352 kB
    • 025.txt.anot0152 kB
    • 004.txt.anot0352 kB
    • 004.txt.anot0152 kB
    • 009.txt.anot0353 kB
    • 009.txt.anot0153 kB
    • 083.txt.anot0349 kB
    • 083.txt.anot0149 kB
    • 088.txt.anot0352 kB
    • 062.txt.anot0346 kB
    • 088.txt.anot0152 kB
    • 062.txt.anot0146 kB
    • 041.txt.anot0252 kB
    • 041.txt.anot0152 kB
    • 032.txt.anot0350 kB
    • 058.txt.anot0251 kB
    • 058.txt.anot0151 kB
    • 032.txt.anot0150 kB
    • 037.txt.anot0352 kB
    • 011.txt.anot0352 kB
    • 037.txt.anot0152 kB
    • 011.txt.anot0152 kB
    • 016.txt.anot0351 kB
    • 016.txt.anot0151 kB
    • 090.txt.anot0252 kB
    • README3 kB

Show simple item record