Common Voice Scripted Speech 23.0 - Phalura
Locale: phl
Size: 228.74 MB
Task: ASR
Format: MP3
License: CC-0
پالولا — Palula (phl
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset
for Palula (phl
). The dataset contains 30 hours of recorded
speech (22 hours validated) from 20 speakers.
Language
Palula is an Indo-Aryan language, specifically a branch of the Dardic group, closely related to Shina. Palula is spoken in several regions of Chitral District, including Ashret, Biori, Kalkatak, and parts of Shishi Koh. Beyond Chitral, it is also spoken in Gumandan (Dir Kohistan) and Sao village in Kunar Province, Afghanistan. The estimated number of speakers ranges between 20,000 and 25,000.
Variants
The dataset comprises a diverse range of linguistic and cultural materials collected from Palula-speaking communities, reflecting the richness of their oral and written traditions. These varieties include: Folk tales, traditional narratives passed down through generations. Cultural and experience stories narratives that document lived experience, seasonal practices, and social customs. Proverbs and Idiomatic Expressions in Palula with Urdu translation. Poetic Texts, Original and traditional poetry composed in Palula with Urdu translation.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, frequency refers to the number of clips annotated with this gender.
Age
Self-declared age information, frequency refers to the number of clips annotated with this age band.
Text corpus
The Palula text corpus has been systematically compiled through fieldwork conducted within the Palula-speaking community, encompassing a diverse range of speakers across age groups and social backgrounds. Data was gathered through structured and semi-structured interviews, oral storytelling sessions, and community-based linguistic elicitation. All recordings were carefully transcribed.The corpus comprises 4000 sentences.
Writing system
Palula orthography based on the Arabic writing system
Symbol table
ا ، ب ، پ ، ت ، ث، ٹ، ج، چ ،ڇ ،څ، ح ، خ، د ، ڈ ،ذ ، ر، ز، ژ، ڙ، س، ش، ݜ، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، ݨ، و، ہ،ھ،ء،ی،ے
Sample
There follows a randomly selected sample of five sentences from the corpus. انی گھوݜٹہ کتی کمرے ہنہ؟ اندہ دُو کمرے آک بہ پراجمی دیرہ ہنوۡ۔ تھی انیۡ کُڈ خختمی دِتیۡ ہِنم کی بٹومی؟ انیوے بُٹھے خختم استعمال بھِلم ہِنم، بٹومی کڈ دتی ہِنیۡ۔ تھی گھوݜٹہ کتی جانہ ہِنہ؟
Sources
Palula Matli (Proverbs) – Author: Naseem Haider - Palula Textbook – Author: Naseem Haider - Palula Gul-Dasta Ashaar (Poetry) – Author: Naseem Haider - Palula–Urdu–English Conversation Guide – Author: Naseem Haider - Palula Folk Tales – Author: Naseem Haider
Text domains
General, Agriculture and Food, Healthcare, History, Law and Governmant, Media and Entertainment, Nature and Environment
Processing
The process involved collecting texts, stories, and proverbs through community-based audio recordings. These recordings were transcribed, and selected sentences were curated for inclusion in the Common Voice dataset. The finalized dataset was then uploaded and subsequently recorded by multiple speakers.
Recommended post-processing
The datasets should be reachable to everyone for adding the sentences and uploading their recording.
Community links
https://www.palulacommunity.org
Discussions
https://www.palulacommunity.org
Contribute
https://www.palulacommunity.org
Datasheet authors
Naseem Haider naseemhaider78@gmail.com
Funding
Meesum Alam Meesum.alam12@gmail.com
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.