License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 12/5/2025
Format: MP3
Size: 591.03 MB
Share
A collection of scripted spoken phrases in Palula.
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
phl)This datasheet is for version 24.0 of the the Mozilla Common Voice Scripted Speech dataset
for Palula (phl). The dataset contains 21158 clips representing 29.27 hours of recorded
speech (21.52 hours validated) from 20 speakers.
Palula is an Indo-Aryan language, specifically a branch of the Dardic group, closely related to Shina. Palula is spoken in several regions of Chitral District, including Ashret, Biori, Kalkatak, and parts of Shishi Koh. Beyond Chitral, it is also spoken in Gumandan (Dir Kohistan) and Sao village in Kunar Province, Afghanistan. The estimated number of speakers ranges between 20,000 and 25,000.
The dataset includes the following distribution of age and gender.
Self-declared gender information, percentage refers to the number of clips annotated with this gender.
| Gender | Percentage |
|---|---|
| Undefined | 100.0% |
Self-declared age information, percentage refers to the number of clips annotated with this age band.
| Age Band | Percentage |
|---|---|
| Undefined | 6.0% |
| Twenties | 48.0% |
| Thirties | 23.0% |
| Teens | 6.0% |
| Fourties | 18.0% |
The Palula text corpus has been systematically compiled through fieldwork conducted within the Palula-speaking community, encompassing a diverse range of speakers across age groups and social backgrounds. Data was gathered through structured and semi-structured interviews, oral storytelling sessions, and community-based linguistic elicitation. All recordings were carefully transcribed.The corpus comprises 4000 sentences.
Palula orthography based on the Arabic writing system
ا ب پ ت ث ٹ ج چ ڇ څ ح خ د ڈ ذ ر ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ں ݨ و ہ ھ ء ی ے
There follows a randomly selected sample of five sentences from the corpus.
انی گھوݜٹہ کتی کمرے ہنہ؟
اندہ دُو کمرے آک بہ پراجمی دیرہ ہنوۡ۔
تھی انیۡ کُڈ خختمی دِتیۡ ہِنم کی بٹومی؟
انیوے بُٹھے خختم استعمال بھِلم ہِنم، بٹومی کڈ دتی ہِنیۡ۔
تھی گھوݜٹہ کتی جانہ ہِنہ؟
Automatic random samples
سعیدے ساجد بیۡ مقابلہ وے شامل بِھلہ۔
مِیشہ وسیع تھےۡ منِیتوۡ کی مہ کُھنہ یھئی بہ بُٹانہ سمئینی اِزدہ تھہ۔
اسحاقہ عسیٰ تھےۡ تاوان کے دِتیۡ؟
ہیوندہ بِیڈیۡ جڑے بِھلِم۔
خلکِیم تسی بات کاݨ تِھیلیۡ۔ سےۡ بیلچے تھسکُورہ گِھنیۡ گیہ تںِم یاب شرو تِھیلیۡ۔
Palula Matli (Proverbs) – Author: Naseem Haider
Palula Textbook – Author: Naseem Haider
Palula Gul-Dasta Ashaar (Poetry) – Author: Naseem Haider
Palula–Urdu–English Conversation Guide – Author: Naseem Haider
Palula Folk Tales – Author: Naseem Haider
| Domain | Count |
|---|---|
| Undefined | 21120 |
| Automotive Transport | 4 |
| Healthcare | 10 |
| History Law Government | 24 |
The process involved collecting texts, stories, and proverbs through community-based audio recordings. These recordings were transcribed, and selected sentences were curated for inclusion in the Common Voice dataset. The finalized dataset was then uploaded and subsequently recorded by multiple speakers.
Naseem Haider naseemhaider78@gmail.com
This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.