Common Voice Scripted Speech 23.0 - Phalura

Locale: phl

Size: 228.74 MB

Task: ASR

Format: MP3

License: CC-0


پالولا — Palula (phl)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Palula (phl). The dataset contains 30 hours of recorded speech (22 hours validated) from 20 speakers.

Language

Palula is an Indo-Aryan language, specifically a branch of the Dardic group, closely related to Shina. Palula is spoken in several regions of Chitral District, including Ashret, Biori, Kalkatak, and parts of Shishi Koh. Beyond Chitral, it is also spoken in Gumandan (Dir Kohistan) and Sao village in Kunar Province, Afghanistan. The estimated number of speakers ranges between 20,000 and 25,000.

Variants

The dataset comprises a diverse range of linguistic and cultural materials collected from Palula-speaking communities, reflecting the richness of their oral and written traditions. These varieties include: Folk tales, traditional narratives passed down through generations. Cultural and experience stories narratives that document lived experience, seasonal practices, and social customs. Proverbs and Idiomatic Expressions in Palula with Urdu translation. Poetic Texts, Original and traditional poetry composed in Palula with Urdu translation.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Text corpus

The Palula text corpus has been systematically compiled through fieldwork conducted within the Palula-speaking community, encompassing a diverse range of speakers across age groups and social backgrounds. Data was gathered through structured and semi-structured interviews, oral storytelling sessions, and community-based linguistic elicitation. All recordings were carefully transcribed.The corpus comprises 4000 sentences.

Writing system

Palula orthography based on the Arabic writing system

Symbol table

ا ، ب ، پ ، ت ، ث، ٹ، ج، چ ،ڇ ،څ، ح ، خ، د ، ڈ ،ذ ، ر، ز، ژ، ڙ، س، ش، ݜ، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، ݨ، و، ہ،ھ،ء،ی،ے

Sample

There follows a randomly selected sample of five sentences from the corpus. انی گھوݜٹہ کتی کمرے ہنہ؟ اندہ دُو کمرے آک بہ پراجمی دیرہ ہنوۡ۔ تھی انیۡ کُڈ خختمی دِتیۡ ہِنم کی بٹومی؟ انیوے بُٹھے خختم استعمال بھِلم ہِنم، بٹومی کڈ دتی ہِنیۡ۔ تھی گھوݜٹہ کتی جانہ ہِنہ؟

Sources

  • Palula Matli (Proverbs) – Author: Naseem Haider - Palula Textbook – Author: Naseem Haider - Palula Gul-Dasta Ashaar (Poetry) – Author: Naseem Haider - Palula–Urdu–English Conversation Guide – Author: Naseem Haider - Palula Folk Tales – Author: Naseem Haider

Text domains

General, Agriculture and Food, Healthcare, History, Law and Governmant, Media and Entertainment, Nature and Environment

Processing

The process involved collecting texts, stories, and proverbs through community-based audio recordings. These recordings were transcribed, and selected sentences were curated for inclusion in the Common Voice dataset. The finalized dataset was then uploaded and subsequently recorded by multiple speakers.

Recommended post-processing

The datasets should be reachable to everyone for adding the sentences and uploading their recording.

Community links

https://www.palulacommunity.org

Discussions

https://www.palulacommunity.org

Contribute

https://www.palulacommunity.org

Datasheet authors

Naseem Haider naseemhaider78@gmail.com

Funding

Meesum Alam Meesum.alam12@gmail.com

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.