Common Voice Scripted Speech 23.0 - Khowar

Locale: khw

Size: 158.45 MB

Task: ASR

Format: MP3

License: CC-0


کھوار — Khowar (khw)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Khowar (khw). The dataset contains 21 hours of recorded speech (18 hours validated) from 49 speakers.

Language

The Khowar-speaking people are the largest group in Chitral and also use as a lingua franca in in the valley. This language is also known as Qashqari by Pashto speakers. It is classified within the Indo-Aryan branch of the Indo-European family. Besides Chitral, Khowar is also spoken in Gilgit-Baltistan and the Swat Valley. The estimated number of Khowar speakers in all regions is more than 600,000, with a population of 400,000 in Chitral alone. Khowar is a literate language, with books, magazines, radio and tive programs, and audio/video documentation. It has been included in the school curriculum since 2017.

Variants

The variety use in the dataset is the Chitral Variety

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Text corpus

The text come from the books FLI and Its partners organisation published. I also wrote my own around 1000 sentences

Writing system

I used the standard writing system that is Perso-Arabic with standard symbol for specific sounds of Khowar

Symbol table

In addition to all letters of Urdu we use the following additional letters. ݱ ݰ ݯ ځ څ

Sample

There follows a randomly selected sample of five sentences from the corpus. دُورہ گیتی اوہتی انوس خلفو دُوروت راہی اریر پُھک خور مڑاغ نسان اریر۔ خلفو دُورہ تورتائے۔ جم بیلیوت خلیفہ دُورہ تان استائے۔ بشاراحولار اچی مڑاغو تو پروشٹہ لاکھیتائے۔ چھوچیو ݱوت وخت اوشوئے۔ داد ا وارانگو کریمتو دیتی ویرومو سورا نیشی استائے۔ہوستو لنگرآبادو نانو ووݰکی درونگیتائے۔ لنگرآبادو نان ہوستہ ویڅھیتائے۔ خور دادو خیرو خبرو گانیتائے۔جم کیہ مہ ژور جم ہنون جم اوریتام تہ پونگ جم بیرائے، اورارو غم بیراؤ بیہیل،بیہیل،بیہیل نانی دی لنگرآبادو نانو سُم بشار احوال کوری ریتائے کی اوا دی جم اوریتام خور انوس کو س دی پونگہ پھاتوکتو دار دیتی روشتی خوماؤ اوشوئے۔ کیہ بیہیل ہوؤ انگاہ نو بیتی گرانیش بیرو بیرائے، تہ پونگ جم تہ دادا دی ہݰ ریتائے بیہیل،بیہیل۔

Sources

  1. Angrestan by Zafar Ullah Pervaz 2. Robinson Cruso, by Fardi 3. Oraya by Farid 4. Translation of MTB MLE material in Khowar by FLI 5. Khowar Material by Farid 6. Human and Children Rights Translation by Farid. 6. Khowar Folktales by Zahoor 7. 100 Sentence by myslf

Text domains

General

Processing

Collected soft books and got copy waiver from authors Put on excel sheet and reviewed the sentences for length and correction Sent to Meesum Alam In my own case upload the sentence directly. Voice over the sentences by different by people Validated the sentences by different people.

Recommended post-processing

We should have authority to upload and edit text sentences.

Datasheet authors

Already mention in the references.

Citation guidelines

Yes, the user need to cite.

Funding

The funding come form Meesum Alam, email. meesum.alam12@gmail.com

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.