Files in this item

This item is
Publicly Available
and licensed under:
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Distributed under Creative Commons Attribution Required Noncommercial No Derivative Works
Icon
Name
CoCB.zip
Size
3.98 MB
Format
application/zip
Description
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, we split the corpus of contemporary text CBB.blog (1 million tokens) with annotators into senteces. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.
MD5
1abb8d3e784da994fe265c31d95d2fcb
 Download file  Preview
 File Preview  
  • corpus_of_contemporary_blogs
    • 090.txt.anot0152 kB
    • 095.txt.anot0350 kB
    • 095.txt.anot0250 kB
    • 074.txt.anot0350 kB
    • 074.txt.anot0150 kB
    • 079.txt.anot0349 kB
    • 053.txt.anot0353 kB
    • 079.txt.anot0149 kB
    • 053.txt.anot0153 kB
    • 044.txt.anot0254 kB
    • 049.txt.anot0351 kB
    • 023.txt.anot0352 kB
    • 049.txt.anot0151 kB
    • 023.txt.anot0252 kB
    • 028.txt.anot0351 kB
    • 028.txt.anot0251 kB
    • 002.txt.anot0252 kB
    • 002.txt.anot0152 kB
    • 007.txt.anot0352 kB
    • 007.txt.anot0252 kB
    • 081.txt.anot0348 kB
    • 081.txt.anot0248 kB
    • 086.txt.anot0352 kB
    • 060.txt.anot0449 kB
    • 060.txt.anot0350 kB
    • 086.txt.anot0152 kB
    • 065.txt.anot0250 kB
    • 065.txt.anot0150 kB
    • 056.txt.anot0351 kB
    • 030.txt.anot0353 kB
    • 056.txt.anot0151 kB
    • 030.txt.anot0152 kB
    • 035.txt.anot0357 kB
    • 035.txt.anot0257 kB
    • 014.txt.anot0351 kB
    • 014.txt.anot0151 kB
    • 019.txt.anot0351 kB
    • 019.txt.anot0251 kB
    • 093.txt.anot0351 kB
    • 093.txt.anot0151 kB
    • 098.txt.anot0239 kB
    • 098.txt.anot0139 kB
    • 072.txt.anot0252 kB
    • 072.txt.anot0152 kB
    • 077.txt.anot0347 kB
    • 077.txt.anot0147 kB
    • 051.txt.anot0251 kB
    • 051.txt.anot0151 kB
    • 068.txt.anot0354 kB
    • 068.txt.anot0154 kB
    • 047.txt.anot0343 kB
    • 047.txt.anot0143 kB
    • 021.txt.anot0251 kB
    • 021.txt.anot0151 kB
    • 026.txt.anot0350 kB
    • 026.txt.anot0150 kB
    • 005.txt.anot0354 kB
    • 005.txt.anot0154 kB
    • 084.txt.anot0349 kB
    • 084.txt.anot0249 kB
    • 089.txt.anot0351 kB
    • 089.txt.anot0250 kB
    • 063.txt.anot0346 kB
    • 063.txt.anot0246 kB
    • 042.txt.anot0350 kB
    • 042.txt.anot0150 kB
    • 059.txt.anot0351 kB
    • 059.txt.anot0250 kB
    • 033.txt.anot0351 kB
    • 033.txt.anot0251 kB
    • 038.txt.anot0354 kB
    • 038.txt.anot0154 kB
    • 012.txt.anot0250 kB
    • 012.txt.anot0150 kB
    • 017.txt.anot0251 kB
    • 017.txt.anot0151 kB
    • 091.txt.anot0352 kB
    • 096.txt.anot0549 kB
    • 091.txt.anot0152 kB
    • 096.txt.anot0249 kB
    • 070.txt.anot0251 kB
    • 070.txt.anot0151 kB
    • 075.txt.anot0352 kB
    • 075.txt.anot0252 kB
    • 054.txt.anot0354 kB
    • 054.txt.anot0149 kB
    • 045.txt.anot0351 kB
    • 045.txt.anot0251 kB
    • 024.txt.anot0354 kB
    • 024.txt.anot0148 kB
    • 003.txt.anot0353 kB
    • 029.txt.anot0251 kB
    • 029.txt.anot0150 kB
    • 003.txt.anot0153 kB
    • 008.txt.anot0252 kB
    • 008.txt.anot0152 kB
    • 082.txt.anot0352 kB
    • 082.txt.anot0152 kB
    • 087.txt.anot0251 kB
    • 061.txt.anot0346 kB
    • 087.txt.anot0151 kB
    • 061.txt.anot0146 kB
    • 066.txt.anot0353 kB
    • 040.txt.anot0352 kB
    • 040.txt.anot0252 kB
    • 066.txt.anot0153 kB
    • 057.txt.anot0252 kB
    • 031.txt.anot0351 kB
    • 057.txt.anot0152 kB
    • 031.txt.anot0151 kB
    • 036.txt.anot0349 kB
    • 010.txt.anot0351 kB
    • 036.txt.anot0149 kB
    • 010.txt.anot0151 kB
    • 015.txt.anot0349 kB
    • 015.txt.anot0149 kB
    • 094.txt.anot0352 kB
    • 099.txt.anot0649 kB
    • 094.txt.anot0252 kB
    • 073.txt.anot0348 kB
    • 099.txt.anot0149 kB
    • 073.txt.anot0248 kB
    • 078.txt.anot0349 kB
    • 078.txt.anot0149 kB
    • 052.txt.anot0251 kB
    • 052.txt.anot0151 kB
    • 069.txt.anot0351 kB
    • 069.txt.anot0251 kB
    • 043.txt.anot0352 kB
    • 043.txt.anot0152 kB
    • 048.txt.anot0350 kB
    • 022.txt.anot0350 kB
    • 048.txt.anot0150 kB
    • 022.txt.anot0150 kB
    • 027.txt.anot0351 kB
    • 027.txt.anot0151 kB
    • 001.txt.anot0252 kB
    • 001.txt.anot0152 kB
    • 100.txt.anot0323 kB
    • 100.txt.anot0223 kB
    • 006.txt.anot0253 kB
    • 006.txt.anot0153 kB
    • 080.txt.anot0352 kB
    • 080.txt.anot0252 kB
    • 085.txt.anot0351 kB
    • 085.txt.anot0251 kB
    • 064.txt.anot0346 kB
    • 064.txt.anot0246 kB
    • 055.txt.anot0351 kB
    • 055.txt.anot0151 kB
    • 034.txt.anot0353 kB
    • 034.txt.anot0253 kB
    • 039.txt.anot0352 kB
    • 039.txt.anot0252 kB
    • 013.txt.anot0350 kB
    • 013.txt.anot0250 kB
    • 018.txt.anot0252 kB
    • 018.txt.anot0152 kB
    • 092.txt.anot0252 kB
    • 097.txt.anot0547 kB
    • 092.txt.anot0152 kB
    • 097.txt.anot0347 kB
    • 071.txt.anot0251 kB
    • 071.txt.anot0151 kB
    • 076.txt.anot0251 kB
    • 076.txt.anot0151 kB
    • 050.txt.anot0250 kB
    • 050.txt.anot0150 kB
    • 067.txt.anot0351 kB
    • 067.txt.anot0151 kB
    • 046.txt.anot0357 kB
    • 020.txt.anot0352 kB
    • 046.txt.anot0257 kB
    • 020.txt.anot0152 kB
    • 025.txt.anot0352 kB
    • 025.txt.anot0152 kB
    • 004.txt.anot0352 kB
    • 004.txt.anot0152 kB
    • 009.txt.anot0353 kB
    • 009.txt.anot0153 kB
    • 083.txt.anot0349 kB
    • 083.txt.anot0149 kB
    • 088.txt.anot0352 kB
    • 062.txt.anot0346 kB
    • 088.txt.anot0152 kB
    • 062.txt.anot0146 kB
    • 041.txt.anot0252 kB
    • 041.txt.anot0152 kB
    • 032.txt.anot0350 kB
    • 058.txt.anot0251 kB
    • 058.txt.anot0151 kB
    • 032.txt.anot0150 kB
    • 037.txt.anot0352 kB
    • 011.txt.anot0352 kB
    • 037.txt.anot0152 kB
    • 011.txt.anot0152 kB
    • 016.txt.anot0351 kB
    • 016.txt.anot0151 kB
    • 090.txt.anot0252 kB
    • README3 kB