dc.contributor.author | Spoustová, Johanka |
dc.contributor.author | Spousta, Miroslav |
dc.date.accessioned | 2012-06-21T11:53:56Z |
dc.date.available | 2012-06-21T11:53:56Z |
dc.date.issued | 2012-06-21 |
dc.identifier.uri | http://hdl.handle.net/11858/00-097C-0000-0006-B847-6 |
dc.description | Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details. |
dc.description.sponsorship | GA405/09/0278 |
dc.language.iso | ces |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.rights | Creative Commons - Attribution 3.0 Unported (CC BY 3.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/ |
dc.subject | corpus |
dc.subject | Czech |
dc.subject | web |
dc.title | CWC2011 |
dc.type | corpus |
metashare.ResourceInfo#ContactInfo#PersonInfo.surname | Spoustová |
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName | Johanka |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName | Charles University in Prague, UFAL |
metashare.ResourceInfo#DistributionInfo.availability | unrestrictedUse |
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium | download |
metashare.ResourceInfo#ValidationInfo.validated | True |
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName | #1-Internet as a Language Corpus |
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType | #1-National |
metashare.ResourceInfo#ContentInfo.mediaType | text |
metashare.ResourceInfo#TextInfo#LanguageInfo.languageCoding | ces |
metashare.ResourceInfo#TextInfo#SizeInfo.size | 2650000000 |
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit | words |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email | johanka@ucw.cz |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
sponsor | Grantová agentura České republiky GA405/09/0278 Internet jako jazykový korpus nationalFunds |
size.info | 2650000000 words |
files.size | 6074441470 |
files.count | 6 |
featuredService.kontext | basic|https://lindat.mff.cuni.cz/services/kontext/first_form?corpname=cwc_11_cs_w |
featuredService.kontext | with syntactic annotation|https://lindat.mff.cuni.cz/services/kontext/first_form?corpname=cwc_parsed_cs_a |
Files in this item
This item is
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
Publicly Available
and licensed under:Creative Commons - Attribution 3.0 Unported (CC BY 3.0)



- Name
- plain.articles_shuffled.txt.bz2
- Size
- 1.17 GB
- Format
- application/x-bzip2
- Description
- Articles, 700M tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- cf9bc9b5d0425af41e3f40dcef62c2e1

- Name
- plain.blogs_shuffled.txt.bz2
- Size
- 2.16 GB
- Format
- application/x-bzip2
- Description
- Blogs, 1.2B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- b37a4cdf02b414793adbb2bab7d5641a

- Name
- plain.discussions_shuffled.txt.bz2
- Size
- 2.27 GB
- Format
- application/x-bzip2
- Description
- Discussions, 1.4B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- 0cccab42183d211515dfbed99aa48b26

- Name
- urls-articles.bz2
- Size
- 20.58 MB
- Format
- application/x-bzip2
- Description
- url list of the articles section
- MD5
- 1a2034c69c80225d666ff80526b7c884

- Name
- urls-blogs.bz2
- Size
- 31.5 MB
- Format
- application/x-bzip2
- Description
- url list of the blogs section
- MD5
- 34a1e6760880d661d7ab7a2da94c9a70

- Name
- urls-discussions.bz2
- Size
- 14.12 MB
- Format
- application/x-bzip2
- Description
- url list of the discussions section
- MD5
- 858ec3d95e6eae67a8a15241dc499801