UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2
Please use the following text to cite this item or export to a predefined format:
Zemánek, Petr; Pospíšil, Adam; Sellat, Hashem; Krubiński, Mateusz and Pecina, Pavel, 2023,
UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5519.
Authors
Item identifier
Date issued
2023
Size
110 minutes
Language(s)
Description
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav".
The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Acknowledgement
European Union
Project code:EC/H2020/870930
Project name:WELCOME - Multiple Intelligent Conversation Agent Services for Reception, Management and Integration of Third Country Nationals in the EU
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- test2024.yaml
- Size
- 93.35 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 1c95e7c3ffa40869705dd7e0b80c85b1

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- test2024.apc
- Size
- 80.88 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 1f8da210de21ca727e84bde564350f00

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- test2024.eng
- Size
- 63.03 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- f7314f5004c75834d692216732c94c6c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- test2024_wav.zip
- Size
- 559.94 MB
- Format
- application/zip
- Description
- Zip
- MD5
- c9d04c06a5f09a1a4b83413fb321a550

- Audio-Dialogues
- 16072022_Family-01.wav108 MB
- D_220623_Windows-01.wav54 MB
- 16072022_Family-02.wav108 MB
- D_220623_Windows-02.wav54 MB
- Audio-Monologues
- Lat_210228.wav39 MB
- Dam_21082021.wav88 MB
- Hom_19082021.wav14 MB
- Dam_09082021.wav61 MB
- Hom24082021.wav33 MB
- Hom_28082021.wav19 MB
- Hom_03072021_2.wav30 MB
- Hom_1072021.wav15 MB
- Hom_03072021_1.wav18 MB
- Hom_2082021.wav30 MB
- Hom_26072021.wav18 MB
- Sed_17082021.wav16 MB

