The YouTube Corpus of Singapore English Podcasts
Please use the following text to cite this item or export to a predefined format:
Coats, Steven; Basile, Carmelo Alessandro; Morin, Cameron and Fuchs, Robert, 2025,
The YouTube Corpus of Singapore English Podcasts, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11372/LRT-5984.
Authors
Item identifier
Demo URL
Referenced by
Date issued
2025
Size
8,380,826 tokens,
620.29 hours,
757,072 utterances
Language(s)
Description
The YouTube Corpus of Singapore English Podcasts (YCSEP) contains transcripts from 620 hours of over 1,300 podcast episodes by Singapore-based content creators. The dataset, diarized into individual speaker turns, contains over 757,000 individual turns and 8.38 million word tokens. Created using a pipeline comprising yt-dlp, WhisperX, and pyannote.audio, it is intended to advance the study of the linguistic and discourse properties of Singapore English.
Publisher
Acknowledgement
Research Council of Finland
Project code:358727
Project name:Kielivarojen ja kieliteknologian tutkimusinfrastruktuuri
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- YCSEP_static.csv
- Size
- 351.12 MB
- Format
- text/csv
- Description
- CSV
- MD5
- e919fda55548b684606c3aeb8dfa4d24

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

