Common Voice Scripted Speech 23.0 - Dameli

Locale: dml

Size: 85.65 MB

Task: ASR

Format: MP3

License: CC-0


[Dameli] — Dameli (dml)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Dameli (dml). The dataset contains 11 hours of recorded speech (11 hours validated) from 5 speakers.

Language

Dameli is one of the most vulnerable languages of Pakistan. The language is spoken in a few remote villages, Asper, Dondidari, Ponagram and Shintari and the surrounding hamlets in the side valley called Damel in northern mountainous area of district Chitral of Khyber Pakhtunkhwa province. This vulnerability becomes more critical because of the community’s fewer numbers of speakers (about 6500 in total) In UNESCO’S Atlas of the world languages in Danger, Dameli is listed as “Severely endangered” (Elnazarov,2010).The entry on Dameli was contributed by Hakim Elnazarov, and was based on information in Decker(1992).

Variants

The dataset represents a wide range of Dameli language usage across different domains. The collected material includes sentences and expressions from: Economic life Social interactions Education Agriculture and farming Poetry and oral traditions History and culture These varieties ensure that the dataset captures the richness of the Dameli language, reflecting not only everyday communication but also specialized fields and traditional knowledge.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Text corpus

The corpus consists of 5,670 sentences in the Dameli language. The data was collected from multiple sources, including published books in Dameli, community-written materials, and newly created sentences designed to reflect everyday use of the language. The aim of compiling this corpus is to represent a wide range of topics such as social life, education, agriculture, economy, poetry, farming, and history. This balanced collection provides a valuable resource for linguistic analysis, documentation, and language technology development.

Writing system

The Dameli corpus is written using the Arabic script (Perso-Arabic style), which is commonly used for many regional languages in Pakistan. The writing system has been adapted to represent Dameli sounds, with some additional diacritics and letters used where necessary to capture specific phonetic distinctions.

Symbol table

آ ا ب پ ت ٹ ث ج چ ڇ څ ح خ د ڈ ذ ر ڑ ز ڙ ژ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ݨ و ہ ھ ء ی ے

Sample

There follows a randomly selected sample of five sentences from the corpus. ماں نم حیات درو ماں گرم ساں نم نعیلہ درو ائی دامن ایک مس آݜنتہ ینُم ما کُل آسپرہ درو دامن لے شُباں درو

Sources

The text corpus was compiled from the following sources: * Published books in the Dameli language * Unpublished community manuscripts and notes * Folk stories, oral traditions, and poetry transcribed into written form * Newly created sentences for grammar and vocabulary coverage * Educational and social materials produced by local speakers

Text domains

General, Agriculture and Food, Finance, Service and Retail, Healthcare, History, Law and Governmant, Media and Entertainment, Nature and Environment, Language Fundamentals (e.g. Digits, Letters, Money)

Processing

The collected texts were first gathered from books, manuscripts, and community contributions. Additional sentences were created to cover gaps in vocabulary and grammar. All materials were transcribed into a consistent format using the Arabic (Perso-Arabic) script adapted for Dameli. The data was then carefully reviewed and proofread to remove errors and ensure accuracy. Finally, the sentences were digitized, standardized, and compiled into a single corpus of 5,670 sentences for use in research and language development.

Recommended post-processing

Recommended post-processing Users of this dataset may consider the following post-processing steps depending on their research goals: Normalization: Ensure consistent spelling, especially where multiple variants of the same word exist. Tokenization: Segment the text into words or morphemes for computational use. POS tagging / annotation: Add part-of-speech or grammatical tags if the dataset will be used for linguistic or NLP applications. Transliteration: Convert the Arabic script into Latin script if required for cross-linguistic comparison. Alignment: If paired with translations, align Dameli sentences with their equivalents in other languages for bilingual analysis.

Community links

As internet access is limited in the Dameli Valley, most local communication takes place through community gatherings, cultural events, and village meeting. However, Dameli people living in cities and outside the valley stay connected online. They maintain a WhatsApp group called “Anjuman Taraqi Damyan Basha”, where members share poetry, cultural materials, news, and language-related resources. In this way, both offline and online platforms help keep the community connected and engaged in language preservation.

Discussions

There are no formal online forums or blogs for discussions related to the dataset. Instead, most of the discussion and coordination took place in the WhatsApp group “Anjuman Taraqi Damyan Basha”, where community members exchanged ideas, shared poetry, cultural materials, and contributed to decisions during the dataset creation process.

Datasheet authors

https://commonvoice.mozilla.org/dml/speak

Funding

This project was funded by the Common Voice Foundation, and we are deeply grateful for their support. These materials were then converted into individual sentences by Mr. Meesum Alam, whose guidance and leadership were instrumental in successfully completing the project. We extend our heartfelt thanks to the Common Voice Foundation for making this work possible, and special appreciation to Mr. Meesum Alam for his invaluable guidance and dedication throughout the project.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.