This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

C4Corpus (CC BY-NC part)

Please use the following text to cite this item or export to a predefined format:
Gurevych, Iryna; Habernal, Ivan and Zayed, Omnia, 2016, C4Corpus (CC BY-NC part), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-2204.
Date issued
2016-04-14
Size
10000000000 tokens
Description
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Acknowledgement
 Files in this item
Name
Lic_by-nc_Lang_es_NoBoilerplate_true_MinHtml_true-r-00022.seg-00000.warc.gz
Size
252.03 MB
Format
application/x-gzip
Description
gzip Archive
MD5
be91b4768d9ef7a0d0a57c6d056f110a
Preview
  File Preview
Name
Lic_by-nc_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00001.warc.gz
Size
49.66 MB
Format
application/x-gzip
Description
gzip Archive
MD5
ec9c70eaa8925085dcd15ad667d34b9e
Preview
  File Preview
Name
Lic_by-nc_Lang_et_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
248.87 KB
Format
application/x-gzip
Description
gzip Archive
MD5
06891f86e24b53ff3cd30be1ad32a86b
Preview
  File Preview
Name
Lic_by-nc_Lang_fa_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
350.26 KB
Format
application/x-gzip
Description
gzip Archive
MD5
e8f875c273fa368de5cd835a4713183e
Preview
  File Preview
Name
Lic_by-nc_Lang_fi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
478.4 KB
Format
application/x-gzip
Description
gzip Archive
MD5
93f6c41913ecef22169d807632028fa6
Preview
  File Preview
Name
Lic_by-nc_Lang_fr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
17.08 MB
Format
application/x-gzip
Description
gzip Archive
MD5
b66ccdaf21aa2adaa82f115061296b09
Preview
  File Preview
Name
Lic_by-nc_Lang_he_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
601.71 KB
Format
application/x-gzip
Description
gzip Archive
MD5
5182a10e9cb434104e949e3e7f8eaf3e
Preview
  File Preview
Name
Lic_by-nc_Lang_hi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
129.82 KB
Format
application/x-gzip
Description
gzip Archive
MD5
5b9bebc69ccc925485259f095d05bf65
Preview
  File Preview
Name
Lic_by-nc_Lang_hr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
1.51 MB
Format
application/x-gzip
Description
gzip Archive
MD5
8162986d4fe4f977e30db78962e6c1ec
Preview
  File Preview
Name
Lic_by-nc_Lang_hu_NoBoilerplate_true_MinHtml_true-r-00024.seg-00000.warc.gz
Size
698.36 KB
Format
application/x-gzip
Description
gzip Archive
MD5
fc6cd4ffd704506cfa6e6148ffb31208
Preview
  File Preview
Name
Lic_by-nc_Lang_id_NoBoilerplate_true_MinHtml_true-r-00007.seg-00000.warc.gz
Size
5.96 MB
Format
application/x-gzip
Description
gzip Archive
MD5
18c3dbff2214a9fcf52972bdf3efcf0f
Preview
  File Preview
Name
Lic_by-nc_Lang_it_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
20.68 MB
Format
application/x-gzip
Description
gzip Archive
MD5
c73a0895c251e1536355d8155c07073c
Preview
  File Preview
Name
Lic_by-nc_Lang_ja_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
521.07 KB
Format
application/x-gzip
Description
gzip Archive
MD5
2d467dcf00fc6b73519c7ecbcb506aa7
Preview
  File Preview
Name
Lic_by-nc_Lang_kn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
44.81 KB
Format
application/x-gzip
Description
gzip Archive
MD5
6e7d0fde85b75381107fa14d2cfbe696
Preview
  File Preview
Name
Lic_by-nc_Lang_ko_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
2.03 MB
Format
application/x-gzip
Description
gzip Archive
MD5
96425614df668737e3aeb0e5e50ab6c2
Preview
  File Preview
Name
Lic_by-nc_Lang_lt_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
229.15 KB
Format
application/x-gzip
Description
gzip Archive
MD5
e7f35c1e01244b55dfe78c3c4e3e3f8c
Preview
  File Preview
Name
Lic_by-nc_Lang_lv_NoBoilerplate_true_MinHtml_true-r-00025.seg-00000.warc.gz
Size
46.39 KB
Format
application/x-gzip
Description
gzip Archive
MD5
189559ee67c7bb83e0f065ec069aadc9
Preview
  File Preview
Name
Lic_by-nc_Lang_mk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
75.68 KB
Format
application/x-gzip
Description
gzip Archive
MD5
c75393224dbd8ddd4a238d4be5fb2a20
Preview
  File Preview
Name
Lic_by-nc_Lang_ml_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
15.98 KB
Format
application/x-gzip
Description
gzip Archive
MD5
891c4ce630fb3f602f7cc8429205b516
Preview
  File Preview
Name
Lic_by-nc_Lang_ne_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
70.01 KB
Format
application/x-gzip
Description
gzip Archive
MD5
65b12b8f0baf41d17bc2eb818e317072
Preview
  File Preview
Name
Lic_by-nc_Lang_nl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
2.73 MB
Format
application/x-gzip
Description
gzip Archive
MD5
adf300960b1963b2ae673ca4fc05d491
Preview
  File Preview
Name
Lic_by-nc_Lang_no_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
490.74 KB
Format
application/x-gzip
Description
gzip Archive
MD5
79271fb1e6649f60f82d16569df54749
Preview
  File Preview
Name
Lic_by-nc_Lang_pa_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
1.33 KB
Format
application/x-gzip
Description
gzip Archive
MD5
47bea29e1fd9a83a8e0f741c3f687d6b
Preview
  File Preview
Name
Lic_by-nc_Lang_pl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
941.07 KB
Format
application/x-gzip
Description
gzip Archive
MD5
3be442fb31376b1d2a8f70a969353780
Preview
  File Preview
Name
Lic_by-nc_Lang_pt_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
91.79 MB
Format
application/x-gzip
Description
gzip Archive
MD5
26146627cbcd72cc725871f88f45b82c
Preview
  File Preview
Name
Lic_by-nc_Lang_ro_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
905.45 KB
Format
application/x-gzip
Description
gzip Archive
MD5
d781ddb2afdb53f2c9b103beeb41860e
Preview
  File Preview
Name
Lic_by-nc_Lang_ru_NoBoilerplate_true_MinHtml_true-r-00024.seg-00000.warc.gz
Size
651.44 KB
Format
application/x-gzip
Description
gzip Archive
MD5
8f5a5b8318ae43bd2dd8ccfabc26560c
Preview
  File Preview
Name
Lic_by-nc_Lang_sk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
498.81 KB
Format
application/x-gzip
Description
gzip Archive
MD5
54eaf233a9bfabca8e706d29c0c6156a
Preview
  File Preview
Name
Lic_by-nc_Lang_sl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
306.11 KB
Format
application/x-gzip
Description
gzip Archive
MD5
812cb4ac3abed2fbe1cedc933384c97a
Preview
  File Preview
Name
Lic_by-nc_Lang_so_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
47.82 KB
Format
application/x-gzip
Description
gzip Archive
MD5
a555f2ff3a0477f716ac38fddc0a4d7d
Preview
  File Preview
Name
Lic_by-nc_Lang_sq_NoBoilerplate_true_MinHtml_true-r-00020.seg-00000.warc.gz
Size
1.66 MB
Format
application/x-gzip
Description
gzip Archive
MD5
e904baca51a0c44506afbb7791cae630
Preview
  File Preview
Name
Lic_by-nc_Lang_sv_NoBoilerplate_true_MinHtml_true-r-00025.seg-00000.warc.gz
Size
1.01 MB
Format
application/x-gzip
Description
gzip Archive
MD5
b8b2b4aac16764dabe65b37154f3a520
Preview
  File Preview
Name
Lic_by-nc_Lang_sw_NoBoilerplate_true_MinHtml_true-r-00026.seg-00000.warc.gz
Size
3.67 KB
Format
application/x-gzip
Description
gzip Archive
MD5
852c4838a8a5de9e4a8507de6d592fb5
Preview
  File Preview
Name
Lic_by-nc_Lang_ta_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
6.09 KB
Format
application/x-gzip
Description
gzip Archive
MD5
c75370fd8dddfc231b8e86e945ec9a4d
Preview
  File Preview
Name
Lic_by-nc_Lang_te_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
10.67 KB
Format
application/x-gzip
Description
gzip Archive
MD5
811dde70288d12ea19d2599a9d65a8f4
Preview
  File Preview
Name
Lic_by-nc_Lang_th_NoBoilerplate_true_MinHtml_true-r-00011.seg-00000.warc.gz
Size
2.12 MB
Format
application/x-gzip
Description
gzip Archive
MD5
1516a4e890710aac7c9495ece9eda61a
Preview
  File Preview
Name
Lic_by-nc_Lang_tl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
389.01 KB
Format
application/x-gzip
Description
gzip Archive
MD5
309cf55349abc8a8a27bcfc79ed69d07
Preview
  File Preview
Name
Lic_by-nc_Lang_tr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
1.22 MB
Format
application/x-gzip
Description
gzip Archive
MD5
6ab491029966b3d070b961a218508d5d
Preview
  File Preview
Name
Lic_by-nc_Lang_uk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
36.98 KB
Format
application/x-gzip
Description
gzip Archive
MD5
1126dfbdf6852d2f223f7d4fd41ecc49
Preview
  File Preview
Name
Lic_by-nc_Lang_unknown_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
9.59 MB
Format
application/x-gzip
Description
gzip Archive
MD5
32d33b917ed99d268396702b6cbe1349
Preview
  File Preview
Name
Lic_by-nc_Lang_vi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
1.5 MB
Format
application/x-gzip
Description
gzip Archive
MD5
bf85039f32163c2e58a403a1093ae627
Preview
  File Preview
Name
Lic_by-nc_Lang_zh-cn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
130.08 KB
Format
application/x-gzip
Description
gzip Archive
MD5
6499c8a03a02be39bcf153fd641975dc
Preview
  File Preview
Name
Lic_by-nc_Lang_zh-tw_NoBoilerplate_true_MinHtml_true-r-00026.seg-00000.warc.gz
Size
184.23 KB
Format
application/x-gzip
Description
gzip Archive
MD5
3bcf993d01658cc0e72389612c77612e
Preview
  File Preview
Name
Lic_by-nc_Lang_af_NoBoilerplate_true_MinHtml_true-r-00009.seg-00000.warc.gz
Size
13.55 KB
Format
application/x-gzip
Description
gzip Archive
MD5
78d26cdef7bc599a49ca8eecba13f3a2
Preview
  File Preview
Name
Lic_by-nc_Lang_ar_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
937.25 KB
Format
application/x-gzip
Description
gzip Archive
MD5
dfe8bac91ae9206fb3115bf44884599a
Preview
  File Preview
Name
Lic_by-nc_Lang_bg_NoBoilerplate_true_MinHtml_true-r-00010.seg-00000.warc.gz
Size
335.68 KB
Format
application/x-gzip
Description
gzip Archive
MD5
a7f9bc0c8d817dcabbb42cbd850e6721
Preview
  File Preview
Name
Lic_by-nc_Lang_bn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
12.78 KB
Format
application/x-gzip
Description
gzip Archive
MD5
c10e60c63adf09390b81187bcd6c43e1
Preview
  File Preview
Name
Lic_by-nc_Lang_cs_NoBoilerplate_true_MinHtml_true-r-00022.seg-00000.warc.gz
Size
1.89 MB
Format
application/x-gzip
Description
gzip Archive
MD5
9f7eb479d129e660497bbd9c751a1456
Preview
  File Preview
Name
Lic_by-nc_Lang_da_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
514.73 KB
Format
application/x-gzip
Description
gzip Archive
MD5
a647c06d398be86546db53ab40c4fe5a
Preview
  File Preview
Name
Lic_by-nc_Lang_de_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
10.88 MB
Format
application/x-gzip
Description
gzip Archive
MD5
9e911adb2ef0469ea99a996ac466767b
Preview
  File Preview
Name
Lic_by-nc_Lang_el_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
1.62 MB
Format
application/x-gzip
Description
gzip Archive
MD5
a6ecae8453cefbefbb5b95e386dcb65d
Preview
  File Preview
Name
Lic_by-nc_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
953.7 MB
Format
application/x-gzip
Description
gzip Archive
MD5
5b33ad2dc94b990922acf90159018bc1
Preview
  File Preview