This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

The YouTube Corpus of Singapore English Podcasts

Please use the following text to cite this item or export to a predefined format:
Coats, Steven; Basile, Carmelo Alessandro; Morin, Cameron and Fuchs, Robert, 2025, The YouTube Corpus of Singapore English Podcasts, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-5984.
Date issued
2025
Size
8,380,826 tokens,
620.29 hours,
757,072 utterances
Language(s)
Description
The YouTube Corpus of Singapore English Podcasts (YCSEP) contains transcripts from 620 hours of over 1,300 podcast episodes by Singapore-based content creators. The dataset, diarized into individual speaker turns, contains over 757,000 individual turns and 8.38 million word tokens. Created using a pipeline comprising yt-dlp, WhisperX, and pyannote.audio, it is intended to advance the study of the linguistic and discourse properties of Singapore English.
Acknowledgement
 Files in this item
Name
YCSEP_static.csv
Size
351.12 MB
Format
text/csv
Description
CSV
MD5
e919fda55548b684606c3aeb8dfa4d24
Preview
  File Preview