Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
Distributed under Creative Commons Attribution Required
Icon
Name
plain.articles_shuffled.txt.bz2
Size
1.17 GB
Format
application/x-bzip2
Description
Articles, 700M tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
MD5
cf9bc9b5d0425af41e3f40dcef62c2e1
 Download file
Icon
Name
plain.blogs_shuffled.txt.bz2
Size
2.16 GB
Format
application/x-bzip2
Description
Blogs, 1.2B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
MD5
b37a4cdf02b414793adbb2bab7d5641a
 Download file
Icon
Name
plain.discussions_shuffled.txt.bz2
Size
2.27 GB
Format
application/x-bzip2
Description
Discussions, 1.4B tokens, sentence-shuffled, plain forms only, sentence-breaks (<s>), one token per line. UTF-8.
MD5
0cccab42183d211515dfbed99aa48b26
 Download file
Icon
Name
urls-articles.bz2
Size
20.58 MB
Format
application/x-bzip2
Description
url list of the articles section
MD5
1a2034c69c80225d666ff80526b7c884
 Download file
Icon
Name
urls-blogs.bz2
Size
31.5 MB
Format
application/x-bzip2
Description
url list of the blogs section
MD5
34a1e6760880d661d7ab7a2da94c9a70
 Download file
Icon
Name
urls-discussions.bz2
Size
14.12 MB
Format
application/x-bzip2
Description
url list of the discussions section
MD5
858ec3d95e6eae67a8a15241dc499801
 Download file