HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India
- Title:
- HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India
- Creator:
- Bafna, Niyati, Žabokrtský, Zdeněk, España-Bonet, Cristina, van Genabith, Josef, Kumar, Lalit "Samyak Lalit", Suman, Sharda, and Shivay, Rahul
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Kavita Kosh Project
- Identifier:
- http://hdl.handle.net/11234/1-4787
- Subject:
- dialect continuum, dialect variation, Indic, Indo-Aryan, Indian, and Hindi
- Type:
- text and corpus
- Description:
- HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - All except Nepali are primarily spoken in (North) India - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Marathi, Punjabi, Sindhi, Gujarati, Bengali, Nepali. These languages already have other large datasets available. Since Kavita Kosh focusses largely on Hindi-related languages, we may have very little data for these other languages in this particular dataset. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Brajbhasha. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The data is segregated by language, and contains each folksong in a different JSON file.
- Language:
- Hindi, Marathi, Magahi, Awadhi, Bhojpuri, Braj, Haryanvi, Rajasthani, Korku, Garhwali, Chhattisgarhi, Bhili, Sanskrit, Angika, Bundeli, Kumaoni, Bhadrawahi, Bengali, Gujarati, Panjabi, Nimadi, Kanauji, Malvi, and Uncoded languages
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB - Relation:
- http://hdl.handle.net/11234/1-4839
- Source:
- https://github.com/niyatibafna/north-indian-dialect-modelling
- Harvested from:
- LINDAT/CLARIAH-CZ repository
- Metadata only:
- false
- Date:
- 2022-07-14
The item or associated files might be "in copyright"; review the provided rights metadata:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- PUB