Show simple item record

 
dc.contributor.author Gurevych, Iryna
dc.contributor.author Habernal, Ivan
dc.contributor.author Zayed, Omnia
dc.date.accessioned 2017-06-07T13:09:38Z
dc.date.available 2017-06-07T13:09:38Z
dc.date.issued 2016-04-14
dc.identifier.uri http://hdl.handle.net/11372/LRT-2208
dc.description A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
dc.language.iso afr
dc.language.iso ara
dc.language.iso ben
dc.language.iso bul
dc.language.iso ces
dc.language.iso dan
dc.language.iso deu
dc.language.iso ell
dc.language.iso eng
dc.language.iso est
dc.language.iso fas
dc.language.iso fin
dc.language.iso fra
dc.language.iso guj
dc.language.iso heb
dc.language.iso hin
dc.language.iso hrv
dc.language.iso hun
dc.language.iso ind
dc.language.iso ita
dc.language.iso jpn
dc.language.iso kan
dc.language.iso kor
dc.language.iso lav
dc.language.iso lit
dc.language.iso mal
dc.language.iso mar
dc.language.iso mkd
dc.language.iso nep
dc.language.iso nld
dc.language.iso nor
dc.language.iso pan
dc.language.iso pol
dc.language.iso por
dc.language.iso ron
dc.language.iso rus
dc.language.iso slk
dc.language.iso slv
dc.language.iso som
dc.language.iso spa
dc.language.iso sqi
dc.language.iso swa
dc.language.iso swe
dc.language.iso tam
dc.language.iso tel
dc.language.iso tgl
dc.language.iso tha
dc.language.iso tur
dc.language.iso ukr
dc.language.iso und
dc.language.iso urd
dc.language.iso vie
dc.language.iso zho
dc.publisher Technische Universität Darmstadt
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-sa/4.0/
dc.source.uri https://dkpro.github.io/dkpro-c4corpus/
dc.subject CommonCrawl
dc.subject Creative Commons
dc.subject Web corpus
dc.subject Amazon Web Services
dc.title C4Corpus (CC BY-SA part)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Ivan Habernal habernal@ukp.informatik.tu-darmstadt.de Technische Universität Darmstadt
sponsor German Research Foundation (DFG) DIP DA 1600/1-1 Information Consolidation: A New Paradigm in Knowledge Search nationalFunds
sponsor Amazon Amazon Web Services in Education Grant Web Services in Education Grant Other
size.info 10000000000 tokens
files.size 15200274764
files.count 62


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Lic_by-sa_Lang_af_NoBoilerplate_true_MinHtml_true-r-00009.seg-00000.warc.gz
Size
2.23 MB
Format
application/x-gzip
MD5
3ba94b274d8b4bb742e65786bcb559d5
 Download file
Icon
Name
Lic_by-sa_Lang_ar_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
37.68 MB
Format
application/x-gzip
MD5
9923830b8ce2812b7c753ab4b4ec7474
 Download file
Icon
Name
Lic_by-sa_Lang_bg_NoBoilerplate_true_MinHtml_true-r-00010.seg-00000.warc.gz
Size
96.78 MB
Format
application/x-gzip
MD5
3618511147cc62e0a9a34bc3732d0232
 Download file
Icon
Name
Lic_by-sa_Lang_bn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
24.07 MB
Format
application/x-gzip
MD5
1c9686242fb8d65aa16851c29b0f1d8b
 Download file
Icon
Name
Lic_by-sa_Lang_cs_NoBoilerplate_true_MinHtml_true-r-00022.seg-00000.warc.gz
Size
151.99 MB
Format
application/x-gzip
MD5
de204fd230d68f59ff59f805f63765da
 Download file
Icon
Name
Lic_by-sa_Lang_da_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
75.86 MB
Format
application/x-gzip
MD5
9b04c043324dd16937a7cdb7de57fb84
 Download file
Icon
Name
Lic_by-sa_Lang_de_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
150.42 MB
Format
application/x-gzip
MD5
fda2433769213e7450132d6f77400a5e
 Download file
Icon
Name
Lic_by-sa_Lang_el_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
106.64 MB
Format
application/x-gzip
MD5
6caf73daf1e3f2cae1b9ed2b8b58c59f
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
953.71 MB
Format
application/x-gzip
MD5
c978973668eac1699954e95dda1a1f21
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00001.warc.gz
Size
953.7 MB
Format
application/x-gzip
MD5
11d4e0493fa56aa72b3ff868bd9b6424
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00002.warc.gz
Size
953.74 MB
Format
application/x-gzip
MD5
5737d16ca93be1b7a9e90187f1e1e590
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00003.warc.gz
Size
953.71 MB
Format
application/x-gzip
MD5
002b240469da10bd85e9db56b9a9ef6d
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00004.warc.gz
Size
953.7 MB
Format
application/x-gzip
MD5
0cc3179ebcba6f578963ef2704a631a0
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00005.warc.gz
Size
953.69 MB
Format
application/x-gzip
MD5
becea1c355e051f42d02a7605cad60d1
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00006.warc.gz
Size
953.69 MB
Format
application/x-gzip
MD5
7453b7440242f6e1fc0070cf25dc4cce
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00007.warc.gz
Size
953.71 MB
Format
application/x-gzip
MD5
54b90a17c78fda673fb6243eef4ee4d3
 Download file
Icon
Name
Lic_by-sa_Lang_en_NoBoilerplate_true_MinHtml_true-r-00017.seg-00008.warc.gz
Size
279.78 MB
Format
application/x-gzip
MD5
d4f744ef7f9a146268d1ea8741eea492
 Download file
Icon
Name
Lic_by-sa_Lang_es_NoBoilerplate_true_MinHtml_true-r-00022.seg-00000.warc.gz
Size
923.58 MB
Format
application/x-gzip
MD5
7671ae9f00e4a7a7578777bb2224b0f8
 Download file
Icon
Name
Lic_by-sa_Lang_et_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
97.93 MB
Format
application/x-gzip
MD5
1023c88a8104b0e1ad35346f2a158631
 Download file
Icon
Name
Lic_by-sa_Lang_fa_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
13.59 MB
Format
application/x-gzip
MD5
1e29ccbad407a51758d936f2671ae824
 Download file
Icon
Name
Lic_by-sa_Lang_fi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
14.26 MB
Format
application/x-gzip
MD5
bce64efa40c0019aca32dc826a3713f0
 Download file
Icon
Name
Lic_by-sa_Lang_fr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
901.37 MB
Format
application/x-gzip
MD5
81c366b5f41e5521832efafaddd9291d
 Download file
Icon
Name
Lic_by-sa_Lang_gu_NoBoilerplate_true_MinHtml_true-r-00024.seg-00000.warc.gz
Size
11.31 MB
Format
application/x-gzip
MD5
abd6adae95e8f6f825e0793bb23c67f8
 Download file
Icon
Name
Lic_by-sa_Lang_he_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
172.82 MB
Format
application/x-gzip
MD5
5fcec1e5e12b9c1e27d4f787d82d3256
 Download file
Icon
Name
Lic_by-sa_Lang_hi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
49.01 MB
Format
application/x-gzip
MD5
c79826cb0ea0839c6a4328b2e987cea0
 Download file
Icon
Name
Lic_by-sa_Lang_hr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
177.07 MB
Format
application/x-gzip
MD5
7f5818c1a04b729b99fd7ea6b3bc27b6
 Download file
Icon
Name
Lic_by-sa_Lang_hu_NoBoilerplate_true_MinHtml_true-r-00024.seg-00000.warc.gz
Size
179.49 MB
Format
application/x-gzip
MD5
81ee126fbcb8a40ea19cbad0724c6012
 Download file
Icon
Name
Lic_by-sa_Lang_id_NoBoilerplate_true_MinHtml_true-r-00007.seg-00000.warc.gz
Size
173.78 MB
Format
application/x-gzip
MD5
37ea4b929cb47677b2eea9aaf4ad8fa1
 Download file
Icon
Name
Lic_by-sa_Lang_it_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
638.79 MB
Format
application/x-gzip
MD5
d807ddd325c3c9f4004162445db1f19e
 Download file
Icon
Name
Lic_by-sa_Lang_ja_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
4.34 MB
Format
application/x-gzip
MD5
d5816790f39ed4360306aadecb69aa81
 Download file
Icon
Name
Lic_by-sa_Lang_kn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
27.41 MB
Format
application/x-gzip
MD5
30a97d6c7960c5d4bc99d855061cd537
 Download file
Icon
Name
Lic_by-sa_Lang_ko_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
2.51 MB
Format
application/x-gzip
MD5
08faf67ea8c5bf026229de8df211835f
 Download file
Icon
Name
Lic_by-sa_Lang_lt_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
46.11 MB
Format
application/x-gzip
MD5
69a749fef69b44a47fcc139b8567f7c3
 Download file
Icon
Name
Lic_by-sa_Lang_lv_NoBoilerplate_true_MinHtml_true-r-00025.seg-00000.warc.gz
Size
930.16 KB
Format
application/x-gzip
MD5
badc0e7f9a13f814284755990a20463a
 Download file
Icon
Name
Lic_by-sa_Lang_mk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
74.91 MB
Format
application/x-gzip
MD5
38ca6dba720c782756aee05f26631fc4
 Download file
Icon
Name
Lic_by-sa_Lang_ml_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
26.82 MB
Format
application/x-gzip
MD5
7c7697f9be921c495fcdf3450695c515
 Download file
Icon
Name
Lic_by-sa_Lang_mr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
7.58 MB
Format
application/x-gzip
MD5
4d0b79d5e6a8cf8c7f1ffade6a8782f4
 Download file
Icon
Name
Lic_by-sa_Lang_ne_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
6.05 MB
Format
application/x-gzip
MD5
94133d95797ad3bd9afd077994873947
 Download file
Icon
Name
Lic_by-sa_Lang_nl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
306.08 MB
Format
application/x-gzip
MD5
39a273b6a1068b11f679476c6c90bd42
 Download file
Icon
Name
Lic_by-sa_Lang_no_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
161.83 MB
Format
application/x-gzip
MD5
91f9b6ee49e0f83cecc5ce1297e39a1b
 Download file
Icon
Name
Lic_by-sa_Lang_pa_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
27.61 KB
Format
application/x-gzip
MD5
479da214344c110e1103947a33d70737
 Download file
Icon
Name
Lic_by-sa_Lang_pl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
322.94 MB
Format
application/x-gzip
MD5
5e65344311a6a2309cf7e2112cc41b72
 Download file
Icon
Name
Lic_by-sa_Lang_pt_NoBoilerplate_true_MinHtml_true-r-00023.seg-00000.warc.gz
Size
385.01 MB
Format
application/x-gzip
MD5
6dd8f016a4755dca65cc85b2d705d969
 Download file
Icon
Name
Lic_by-sa_Lang_ro_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
98.72 MB
Format
application/x-gzip
MD5
c9cb59e9207a5bf396905d50b480a0f2
 Download file
Icon
Name
Lic_by-sa_Lang_ru_NoBoilerplate_true_MinHtml_true-r-00024.seg-00000.warc.gz
Size
53.82 MB
Format
application/x-gzip
MD5
016b8932f0977a680660f9f5a75e9bef
 Download file
Icon
Name
Lic_by-sa_Lang_sk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
52.19 MB
Format
application/x-gzip
MD5
b1427cbf34030d9e76a7d6ec306e5286
 Download file
Icon
Name
Lic_by-sa_Lang_sl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
48.71 MB
Format
application/x-gzip
MD5
0b8855ab159e55c6ee7d17396b1260c0
 Download file
Icon
Name
Lic_by-sa_Lang_so_NoBoilerplate_true_MinHtml_true-r-00018.seg-00000.warc.gz
Size
2.14 MB
Format
application/x-gzip
MD5
da49a14cdfd842361f55c6637c2bd259
 Download file
Icon
Name
Lic_by-sa_Lang_sq_NoBoilerplate_true_MinHtml_true-r-00020.seg-00000.warc.gz
Size
20.05 MB
Format
application/x-gzip
MD5
d4e28185a495a3ad0fb9079bc103f792
 Download file
Icon
Name
Lic_by-sa_Lang_sv_NoBoilerplate_true_MinHtml_true-r-00025.seg-00000.warc.gz
Size
192.57 MB
Format
application/x-gzip
MD5
6dabf8683ac81362397439d328f2aee1
 Download file
Icon
Name
Lic_by-sa_Lang_sw_NoBoilerplate_true_MinHtml_true-r-00026.seg-00000.warc.gz
Size
7.92 MB
Format
application/x-gzip
MD5
d3f9ee518e17d891ff5e43d44ad5fbd5
 Download file
Icon
Name
Lic_by-sa_Lang_ta_NoBoilerplate_true_MinHtml_true-r-00004.seg-00000.warc.gz
Size
41.22 MB
Format
application/x-gzip
MD5
567227134b72d3abe84848cfe42f14f8
 Download file
Icon
Name
Lic_by-sa_Lang_te_NoBoilerplate_true_MinHtml_true-r-00008.seg-00000.warc.gz
Size
30.16 MB
Format
application/x-gzip
MD5
6a88b62b8feb883ea37abba8126da629
 Download file
Icon
Name
Lic_by-sa_Lang_th_NoBoilerplate_true_MinHtml_true-r-00011.seg-00000.warc.gz
Size
4.56 MB
Format
application/x-gzip
MD5
64ac2ac144cb1bdcbf2338bd6fe77c71
 Download file
Icon
Name
Lic_by-sa_Lang_tl_NoBoilerplate_true_MinHtml_true-r-00015.seg-00000.warc.gz
Size
17.11 MB
Format
application/x-gzip
MD5
14ce6c41091aa043b6daf0911c9f8597
 Download file
Icon
Name
Lic_by-sa_Lang_tr_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
36.28 MB
Format
application/x-gzip
MD5
d3eb558aece6f3e1fccb4513d4db66b3
 Download file
Icon
Name
Lic_by-sa_Lang_uk_NoBoilerplate_true_MinHtml_true-r-00014.seg-00000.warc.gz
Size
207.11 MB
Format
application/x-gzip
MD5
7be2b288911bcf70ff9e77373e26a84e
 Download file
Icon
Name
Lic_by-sa_Lang_unknown_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
272.89 MB
Format
application/x-gzip
MD5
ffdd57906fc0703460d25350eca41cd3
 Download file
Icon
Name
Lic_by-sa_Lang_ur_NoBoilerplate_true_MinHtml_true-r-00021.seg-00000.warc.gz
Size
2.57 MB
Format
application/x-gzip
MD5
ecd45aa5edef0212f2ac54dfc5ae0477
 Download file
Icon
Name
Lic_by-sa_Lang_vi_NoBoilerplate_true_MinHtml_true-r-00012.seg-00000.warc.gz
Size
126.1 MB
Format
application/x-gzip
MD5
865be65e848d0f75a7e6a02fb0a17b5d
 Download file
Icon
Name
Lic_by-sa_Lang_zh-cn_NoBoilerplate_true_MinHtml_true-r-00017.seg-00000.warc.gz
Size
1 MB
Format
application/x-gzip
MD5
68937bb315825f1f1fb2d148a76a750f
 Download file
Icon
Name
Lic_by-sa_Lang_zh-tw_NoBoilerplate_true_MinHtml_true-r-00026.seg-00000.warc.gz
Size
350.96 KB
Format
application/x-gzip
MD5
a624ab534f4b30c09c809949d1e6fdbf
 Download file

Show simple item record