Files in this item
This item is
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
Publicly Available
and licensed under:Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
- Name
- plain.articles_shuffled.txt.bz2
- Size
- 1.17 GB
- Format
- application/x-bzip2
- Description
- Articles, 700M tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- cf9bc9b5d0425af41e3f40dcef62c2e1
- Name
- plain.blogs_shuffled.txt.bz2
- Size
- 2.16 GB
- Format
- application/x-bzip2
- Description
- Blogs, 1.2B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- b37a4cdf02b414793adbb2bab7d5641a
- Name
- plain.discussions_shuffled.txt.bz2
- Size
- 2.27 GB
- Format
- application/x-bzip2
- Description
- Discussions, 1.4B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
- MD5
- 0cccab42183d211515dfbed99aa48b26
- Name
- urls-articles.bz2
- Size
- 20.58 MB
- Format
- application/x-bzip2
- Description
- url list of the articles section
- MD5
- 1a2034c69c80225d666ff80526b7c884
- Name
- urls-blogs.bz2
- Size
- 31.5 MB
- Format
- application/x-bzip2
- Description
- url list of the blogs section
- MD5
- 34a1e6760880d661d7ab7a2da94c9a70
- Name
- urls-discussions.bz2
- Size
- 14.12 MB
- Format
- application/x-bzip2
- Description
- url list of the discussions section
- MD5
- 858ec3d95e6eae67a8a15241dc499801