Common Voice Scripted Speech 23.0 - Dameli
Locale: dml
Size: 85.65 MB
Task: ASR
Format: MP3
License: CC-0
[Dameli] — Dameli (dml
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset
for Dameli (dml
). The dataset contains 11 hours of recorded
speech (11 hours validated) from 5 speakers.
Language
Dameli is one of the most vulnerable languages of Pakistan. The language is spoken in a few remote villages, Asper, Dondidari, Ponagram and Shintari and the surrounding hamlets in the side valley called Damel in northern mountainous area of district Chitral of Khyber Pakhtunkhwa province. This vulnerability becomes more critical because of the community’s fewer numbers of speakers (about 6500 in total) In UNESCO’S Atlas of the world languages in Danger, Dameli is listed as “Severely endangered” (Elnazarov,2010).The entry on Dameli was contributed by Hakim Elnazarov, and was based on information in Decker(1992).
Variants
The dataset represents a wide range of Dameli language usage across different domains. The collected material includes sentences and expressions from: Economic life Social interactions Education Agriculture and farming Poetry and oral traditions History and culture These varieties ensure that the dataset captures the richness of the Dameli language, reflecting not only everyday communication but also specialized fields and traditional knowledge.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, frequency refers to the number of clips annotated with this gender.
Age
Self-declared age information, frequency refers to the number of clips annotated with this age band.
Text corpus
The corpus consists of 5,670 sentences in the Dameli language. The data was collected from multiple sources, including published books in Dameli, community-written materials, and newly created sentences designed to reflect everyday use of the language. The aim of compiling this corpus is to represent a wide range of topics such as social life, education, agriculture, economy, poetry, farming, and history. This balanced collection provides a valuable resource for linguistic analysis, documentation, and language technology development.
Writing system
The Dameli corpus is written using the Arabic script (Perso-Arabic style), which is commonly used for many regional languages in Pakistan. The writing system has been adapted to represent Dameli sounds, with some additional diacritics and letters used where necessary to capture specific phonetic distinctions.
Symbol table
آ ا ب پ ت ٹ ث ج چ ڇ څ ح خ د ڈ ذ ر ڑ ز ڙ ژ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ݨ و ہ ھ ء ی ے
Sample
There follows a randomly selected sample of five sentences from the corpus. ماں نم حیات درو ماں گرم ساں نم نعیلہ درو ائی دامن ایک مس آݜنتہ ینُم ما کُل آسپرہ درو دامن لے شُباں درو
Sources
The text corpus was compiled from the following sources: * Published books in the Dameli language * Unpublished community manuscripts and notes * Folk stories, oral traditions, and poetry transcribed into written form * Newly created sentences for grammar and vocabulary coverage * Educational and social materials produced by local speakers
Text domains
General, Agriculture and Food, Finance, Service and Retail, Healthcare, History, Law and Governmant, Media and Entertainment, Nature and Environment, Language Fundamentals (e.g. Digits, Letters, Money)
Processing
The collected texts were first gathered from books, manuscripts, and community contributions. Additional sentences were created to cover gaps in vocabulary and grammar. All materials were transcribed into a consistent format using the Arabic (Perso-Arabic) script adapted for Dameli. The data was then carefully reviewed and proofread to remove errors and ensure accuracy. Finally, the sentences were digitized, standardized, and compiled into a single corpus of 5,670 sentences for use in research and language development.
Recommended post-processing
Recommended post-processing Users of this dataset may consider the following post-processing steps depending on their research goals: Normalization: Ensure consistent spelling, especially where multiple variants of the same word exist. Tokenization: Segment the text into words or morphemes for computational use. POS tagging / annotation: Add part-of-speech or grammatical tags if the dataset will be used for linguistic or NLP applications. Transliteration: Convert the Arabic script into Latin script if required for cross-linguistic comparison. Alignment: If paired with translations, align Dameli sentences with their equivalents in other languages for bilingual analysis.
Community links
As internet access is limited in the Dameli Valley, most local communication takes place through community gatherings, cultural events, and village meeting. However, Dameli people living in cities and outside the valley stay connected online. They maintain a WhatsApp group called “Anjuman Taraqi Damyan Basha”, where members share poetry, cultural materials, news, and language-related resources. In this way, both offline and online platforms help keep the community connected and engaged in language preservation.
Discussions
There are no formal online forums or blogs for discussions related to the dataset. Instead, most of the discussion and coordination took place in the WhatsApp group “Anjuman Taraqi Damyan Basha”, where community members exchanged ideas, shared poetry, cultural materials, and contributed to decisions during the dataset creation process.
Datasheet authors
https://commonvoice.mozilla.org/dml/speak
Funding
This project was funded by the Common Voice Foundation, and we are deeply grateful for their support. These materials were then converted into individual sentences by Mr. Meesum Alam, whose guidance and leadership were instrumental in successfully completing the project. We extend our heartfelt thanks to the Common Voice Foundation for making this work possible, and special appreciation to Mr. Meesum Alam for his invaluable guidance and dedication throughout the project.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.