Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the English data part of the dataset. and This research was funded by the Ministry of
Education, Youth and Sports of the Czech Republic under the grant agreement
LK11221.
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9