This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 3

Please use the following text to cite this item or export to a predefined format:
Zemánek, Petr; Pospíšil, Adam; Sellat, Hashem and Pecina, Pavel, 2023, UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 3, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5924.
Date issued
2023
Size
83 minutes
Description
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav". The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 22nd edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2025.
Acknowledgement
 Files in this item
Name
test2025.eng
Size
44.76 KB
Format
application/octet-stream
Description
Speech translation to eng
MD5
e18342a647911dd72c039b765ba2c3b4
Preview
  File Preview
Name
test2025.yaml
Size
110.28 KB
Format
application/octet-stream
Description
Audio-to-text segment mapping
MD5
9c7cb2c370c7643a77aefcfa9f795fba
Preview
  File Preview
Name
test2025_wav.zip
Size
509 MB
Format
application/zip
Description
Audio files
MD5
d51567cd21ee79307d2b2e9d46e821d9
Preview
  File Preview
  • test2025_wav
    • Audio-Dialogues
      • D-Tar220623_Sport-02.wav120 MB
      • D-Tar_220623_Travel-01.wav97 MB
      • D-Tar_220623_Coffee-01.wav60 MB
      • D-Tar220623_Sport-01.wav120 MB
      • D-Tar_220623_Travel-02.wav97 MB
      • D-Tar_220623_Coffee-02.wav60 MB
    • Audio-Monologues
      • Dam_16122020.wav41 MB
      • Jor_210429.wav24 MB
      • Dam_17082021_2.wav11 MB
      • Dam_17082021_1.wav7 MB
Name
test2025.apc
Size
56.37 KB
Format
application/octet-stream
Description
Speech transcription to apc
MD5
29eba14471926d2018bc70e27c155d5f
Preview
  File Preview