Show simple item record Bafna, Niyati Žabokrtský, Zdeněk España-Bonet, Cristina van Genabith, Josef Kumar, Lalit "Samyak Lalit" Suman, Sharda Shivay, Rahul 2022-09-16T14:57:43Z 2022-09-16T14:57:43Z 2022-07-14
dc.description HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.
dc.language Baiga
dc.language Himachali
dc.language Khadi Boli
dc.language.iso hin
dc.language.iso mar
dc.language.iso mag
dc.language.iso awa
dc.language.iso bho
dc.language.iso bra
dc.language.iso bgc
dc.language.iso raj
dc.language.iso kfq
dc.language.iso gbm
dc.language.iso hne
dc.language.iso bhb
dc.language.iso san
dc.language.iso anp
dc.language.iso bns
dc.language.iso kfy
dc.language.iso bhd
dc.language.iso ben
dc.language.iso guj
dc.language.iso pan
dc.language.iso noe
dc.language.iso bjj
dc.language.iso mup
dc.language.iso mis
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher Kavita Kosh Project
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.subject dialect continuum
dc.subject dialect variation
dc.subject Indic
dc.subject Indo-Aryan
dc.subject Indian
dc.subject Hindi
dc.title HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
contact.person Niyati Bafna Universität des Saarlandes 356037 tokens
files.size 1033077
files.count 1

 Files in this item

1008.86 KB
Zip archive
 Download file  Preview
 File Preview  
  • HinDialect 1.1
    • korku-kfq.txt213 kB
    • sanskrit-san.txt3 kB
    • panjabi-pan.txt843 kB
    • baiga-mis.txt168 kB
    • marathi-mar.txt21 kB
    • himachali-mis.txt5 kB
    • braj-bra.txt116 kB
    • nimadi-noe.txt183 kB
    • kumaoni-kfy.txt13 kB
    • hindi-hin.txt1 kB
    • awadhi-awa.txt65 kB
    • haryanvi-bgc.txt616 kB
    • rajasthani-raj.txt96 kB
    • gujarati-guj.txt22 kB
    • bhojpuri-bho.txt257 kB
    • garhwali-gbm.txt413 kB
    • bhadrawahi-bhd.txt12 kB
    • magahi-mag.txt462 kB
    • khadi_boli-mis.txt56 kB
    • angika-anp.txt274 kB
    • chhattisgarhi-hne.txt374 kB
    • bundeli-bns.txt352 kB
    • bengali-ben.txt11 kB
    • kanauji-bjj.txt4 kB
    • bhili-bhb.txt339 kB
    • malvi-mup.txt129 kB

Show simple item record