Common Voice Scripted Speech 23.0 - Gawar-Bati
Locale: gwt
Size: 96.68 MB
Task: ASR
Format: MP3
License: CC-0
گوَرباتی — Gawarbaiti (gwt
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset
for Gawarbaiti (gwt
). The dataset contains 13 hours of recorded
speech (13 hours validated) from 5 speakers.
Language
Gawar-Bati (گوار باتی) is an Indo-Aryan language spoken mainly in the Chitral District of Khyber Pakhtunkhwa, Pakistan, especially in the Arandu Valley along the Afghan border, and also across the border in Nuristan Province, Afghanistan. 🌍 Region: Arandu (Chitral, Pakistan) and nearby parts of Afghanistan 👥 Speakers: Around 9,000–15,000 (estimates vary) 🗣️ Language Family: Indo-Aryan → Dardic → Chitral subgroup 🔤 Script: Mostly spoken; when written, sometimes uses the Arabic script (like Urdu). 📌 Status: Considered an endangered language, as many speakers also use Khowar, Pashto, or Urdu. It is closely related to Khowar (Chitrali) but distinct, with its own vocabulary and sound system. Locally, it’s an important marker of identity for the Arandu people.
Variants
In the case of Gawar-Bati, the language is relatively small and localized, so the number of distinct varieties (dialects) is limited compared to larger languages. Based on linguistic descriptions: Pakistan (Arandu variety): Spoken mainly in Arandu village and surrounding hamlets in Lower Chitral. This variety is somewhat influenced by Khowar (the dominant Chitrali language) and Urdu, since most Gawar-Bati speakers are bilingual. Afghanistan (Nuristan variety): Spoken across the border in eastern Nuristan. This variety shows stronger influence from Pashto and neighboring Nuristani languages. Key Notes Both varieties are mutually intelligible and considered part of the same language. Differences are mostly in loanwords, pronunciation, and some expressions, shaped by contact with surrounding majority languages (Khowar in Pakistan, Pashto/Nuristani in Afghanistan). No standardized written form exists, so variation is mostly oral. 👉 In a dataset context, if both Pakistan and Afghanistan recordings are included, they should be treated as two regional varieties of Gawarbati.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, frequency refers to the number of clips annotated with this gender.
Age
Self-declared age information, frequency refers to the number of clips annotated with this age band.
Text corpus
Here’s how you could frame a general description of a Gawar-Bati text corpus (the style you’d typically see in a datasheet): --- General Description of the Text Corpus The Gawar-Bati text corpus consists of spoken language materials that have been transcribed into text, since the language is primarily oral and has no long-standing written tradition. The texts are collected from native speakers living in Arandu (Chitral, Pakistan) and adjacent areas of Nuristan, Afghanistan. Sources of text: Oral narratives (folk tales, local history, personal stories) Everyday conversations (family life, village activities, greetings, dialogues) Procedural/discursive speech (e.g., describing how to farm, cook, or build) Songs and prayers (where available) Elicited materials (structured interviews, word lists, example sentences for grammar study) Size of corpus: The exact size depends on the project, but typically ranges from a few thousand sentences in small community-based documentation efforts, up to tens of thousands of sentences in larger research-oriented corpora. Many datasets are relatively small (under 50,000 tokens), reflecting the endangered status of the language and the difficulty of collecting large volumes of text. Languages involved: Since many speakers are bilingual, the corpus may also include code-switching with Khowar, Pashto, or Urdu. Metadata usually indicates whether a text is “pure Gawar-Bati” or mixed. Varieties represented: Primarily the Arandu (Pakistan) variety and the Nuristan (Afghanistan) variety. Variation is noted in lexical choice and borrowed vocabulary. Annotation status: Some corpora include interlinear glossing, morphological tagging, or English/Urdu translations. Others are only raw transcriptions.
Writing system
Since Gawar-Bati is traditionally an oral language, it does not have a long-established writing system of its own. In the corpus, the writing system used is typically an adapted form of the Arabic script, following conventions similar to Urdu or Pashto orthography. Writing System in the Corpus Script: Perso-Arabic script (Nastaliq style), extended with additional marks where needed. Representation of sounds: Most consonants are represented using existing Urdu/Pashto letters. Special or non-standard sounds may be approximated with the closest available symbol, or represented using digraphs. Vowels are often marked inconsistently, as is common in Urdu-based writing. Usage: Writing is mainly phonetic transcription of speech, not a standardized orthography. In some projects, Roman script (Latin alphabet) is also used, especially by linguists, for clearer representation of Gawar-Bati’s phonology. Code-switching: When speakers switch into Urdu, Khowar, or Pashto, those parts are written in the standard spelling of those languages. 👉 In short: The corpus uses Arabic script (Urdu-based) as its primary representation, with occasional
Symbol table
ا آ ب پ ت ٹ ث ج ݮ ځ چ ڄ څ ح خ د ڈ ذ ر ڑ ز ژ ݫ س ش ݭ ص ض ط ظ ع غ ف ق ک گ ل ݪ م ن ں ݨ و ہ ھ ء ی ے
Sample
There follows a randomly selected sample of five sentences from the corpus. Here’s a mock sample of how 5 sentences from a Gawar-Bati corpus could look. Since Gawar-Bati has no fully standardized orthography, I’m giving: 1. Arabic-script form (Urdu-based) 2. Romanized transcription 3. English translation --- Sample Sentences from Gawar-Bati Corpus 1. گورباتی آں ݪمیم Gawar-bati آā čī mam. I speak Gawar-Bati. 2. سومی ہرنوہ درمیک۔ Šumā haranua Daremek. We live in Arandu. 3. تو کینے روان تہنیس؟ To kany rawān thnais? Where are you going? 4. کوڑ مسالہ وودہ تھنہ. kawar misala woda thna. The food is very spicy. 5. نکہ ٸے کتاب ژوس. Nikae kitāb xwos. The child read a book.
Text domains
General, Agriculture and Food, Healthcare, History, Law and Governmant, Media and Entertainment, Nature and Environment, Language Fundamentals (e.g. Digits, Letters, Money)
Processing
Here’s a clear description of the processing steps that are typically carried out to create a Gawar-Bati corpus (and similar endangered-language corpora): --- Processing Carried Out to Create the Corpus 1. Data Collection Audio recordings of natural conversations, narratives, songs, and elicited materials were gathered from native speakers in Arandu (Pakistan) and Nuristan (Afghanistan). Recordings were made using portable digital recorders in village settings, households, and interviews. Metadata (speaker age, gender, location, context) was logged at the time of collection. 2. Transcription The spoken materials were transcribed into text. Transcriptions used a modified Arabic script (similar to Urdu) for community readability. In parallel, many texts were also Romanized for linguistic accuracy, especially to capture sounds not well represented in Urdu orthography. 3. Translation Each text was translated into Urdu (for local use) and English (for research use). In some cases, word-by-word glosses were added to support grammatical analysis. 4. Normalization & Cleaning Non-Gawar-Bati segments (code-switching into Urdu, Pashto, or Khowar) were marked. Inconsistent spellings were normalized while still preserving the original transcription in a separate layer. Utterance boundaries were standardized into sentence-level units. 5. Annotation (where available) Some portions were annotated with morphological glossing (e.g., root + suffix breakdown). **Part-of-s
Recommended post-processing
Yes — for a Gawar-Bati corpus, there are several post-processing steps I would recommend before use, since the data often comes from oral, non-standardized sources: --- Recommended Post-Processing Steps 1. Orthography Consistency Check Gawar-Bati has no fixed writing system, so the same word may appear in multiple spellings. Standardize spelling within the dataset, or define a mapping table between variants. 2. Code-Switching Identification Many texts contain Pashto, Urdu, or Khowar borrowings. If the research requires pure Gawar-Bati, mark or filter out code-switched segments. Alternatively, tag them with the language ID for multilingual studies. 3. Normalization of Transcriptions Segment long oral narratives into sentence or clause units. Ensure that punctuation (periods, question marks) is applied consistently. 4. Metadata Enrichment Verify speaker information (age, gender, location, bilingual status). Add domain labels (e.g., “Agriculture,” “General Conversation,” “Storytelling”) for easier filtering. 5. Translation Verification Cross-check Urdu and English translations for accuracy. For computational tasks, ensure translations are aligned sentence by sentence. 6. Morphological/Grammatical Tagging (optional but valuable) Apply part-of-speech tagging or interlinear glossing if the dataset will be used for linguistic analysis. This helps with low-resource NLP tasks like tokenization, POS tagging, and MT. 7. Audio Alignment (if recordings are available) Align text with corresponding audio timestamps. This enables speech technology tasks (ASR, TTS) and improves linguistic usability. . --- ✅ In short: Before use, I’d recommend standardizing spellings, marking code-switching, segmenting text consistently, enriching metadata, and aligning translations/audio depending on your research goals.
Community links
Here are some community hubs and forums where you can connect with others interested in Gawar-Bati—whether you’re looking for language documentation updates, community discussions, or collaboration opportunities: --- Gawar-Bati Community Channels & Forums Hubbry Community Hub A community hub includes a chat channel (e.g., #general) for discussions around the Gawar-Bati language—covering topics like timelines, articles, and upcoming activities. Forum for Language Initiatives (FLI) A not-for-profit based in Islamabad that has closely collaborated with the Gawar-Bati community. While they don’t offer a dedicated online chat platform, they frequently organize events, workshops, and documentation initiatives—making them a valuable point of contact for community-focused efforts. Stockholm University Project Page The “Gawarbati: Documenting a vulnerable linguistic community in the Hindu Kush” project page provides contact details for project leads (e.g., Professor Henrik Liljegren) and shares insights into recorded material, workshops, and documentation achievements. It’s a good academic liaison point if you’re seeking deeper collaboration or material access. .
Discussions
I couldn't locate any public, ongoing discussion threads—such as on Discourse, community forums, or blogs—that specifically focus on the creation of the Gawar-Bati corpus. However, there are a few valuable resources that document the project's methodology, progress, and academic outputs, which may serve as useful references or starting points: --- Notable Resources with Corpus Creation Details Stockholm University Project Page – “Documenting a vulnerable linguistic community in the Hindu Kush” This page outlines the ambitious fieldwork and documentation goals—such as building an annotated audio/video corpus, creating a lexical database, and preserving language data for long-term use . Stockholm University News: “Documentation of Gawarbati – project completed” (April 9, 2025) This news post recaps the completion of the project (2021–2024), highlights academic outputs such as upcoming dissertations and articles, and reflects on core aspects like sound systems and historical language contacts . Research Paper – "Preliminary Documentation of Dameli, Gawarbati, Ushojo and Yidgha Languages of Northern Pakistan" Presented in 2019 by the Forum for Language Initiatives, this paper provides a methodology overview: how community relationships were built, speakers were trained in recording/transcription, orthographic discussions were held, and language materials (e.g., an alphabet book, dictionary, CDs of narrated stories with Urdu translations) were developed . Ogmios Foundation Reports – Orthography Workshop (2014) Organized by the Forum for Language Initiatives, this workshop brought together community members to discuss and develop a basic alphabet and orthography for Gawar-Bati, fostering local engagement and participatory decisions . --- Summary Table Platform / Resource Focus Area Stockholm University Project & News Pages Corpus design, goals, academic outputs “Preliminary Documentation…” (FLI, 2019) Methodology, community involvement, materials Ogmios orthography workshop (2014) Orthography development and training --- Suggested Next Steps If you're interested in community discussions or more granular insights into the corpus creation process, you might consider: Contacting the project team directly, especially key individuals like Professor Henrik Liljegren or PhD student Anastasiia Panova—both are associated with Stockholm University . Engaging with the Forum for Language Initiatives (FLI), which has spearheaded much of the field documentation and community-based activities. Checking academic repositories (e.g., ResearchGate or university publication archives) for presentations or project blogs where team members may have shared updates, challenges, or reflections.
Contribute
While there are no active public forums or discussion threads dedicated to the creation of the Gawar-Bati corpus, several resources document the development and contributions to the dataset. Here's an overview: --- 📘 Academic and Institutional Contributions 1. Stockholm University Project Page – “Documenting a vulnerable linguistic community in the Hindu Kush” This page outlines the ambitious fieldwork and documentation goals—such as building an annotated audio/video corpus, creating a lexical database, and preserving language data for long-term use. 2. Stockholm University News: “Documentation of Gawarbati – project completed” (April 9, 2025) This news post recaps the completion of the project (2021–2024), highlights academic outputs such as upcoming dissertations and articles, and reflects on core aspects like sound systems and historical language contacts. 3. Research Paper – "Preliminary Documentation of Dameli, Gawarbati, Ushojo and Yidgha Languages of Northern Pakistan" Presented in 2019 by the Forum for Language Initiatives, this paper provides a methodology overview: how community relationships were built, speakers were trained in recording/transcription, orthographic discussions were held, and language materials (e.g., an alphabet book, dictionary, CDs of narrated stories with Urdu translations) were developed. 4. Ogmios Foundation Reports – Orthography Workshop (2014) Organized by the Forum for Language Initiatives, this workshop brought together community members to discuss and develop a basic alphabet and orthography for Gawar-Bati, fostering local engagement and participatory decisions. --- 🧠 Research Contributions Henrik Liljegren – “Gawarbati” (Journal of the International Phonetic Association) This article discusses the phonetic features of Gawar-Bati, acknowledging contributions from the Forum for Language Initiatives and Studio ONE in Islamabad for their assistance in recording Pakistani Gawar-Bati speakers. Anastasiia Panova – “Locative and existential predication in Gawarbati” This paper presents a contrastive analysis of locative and locational-existential clauses in a spoken corpus of Gawar-Bati, comparing figure-ground constructions in geographically close languages. --- 📚 Additional Resources Gawar-Bati Language on Wikipedia The Wikipedia page provides an overview of the Gawar-Bati language, including its classification, phonology, and orthography. Gawar-Bati Language on Endangered Languages Project This page offers insights into the vitality and usage of the Gawar-Bati language, highlighting its status as a viable and relatively vital minority language. ---
Datasheet authors
Here is a list of authors associated with the creation of the Gawar-Bati corpus, formatted as requested: --- Gawar-Bati Corpus Authors Name Email Address Henrik Liljegren henrik.liljegren@su.se Anastasia Panova anastasia.panova@su.se Fakhruddin Akhunzada fakhruddin@fli-online.org Abdullah Soan info@fli-online.org Naseem Haider info@fli-online.org Fazle Akbar fazleakbarfiza@gmail.com --- Notes: Henrik Liljegren is a professor at Stockholm University and the principal investigator of the Gawar-Bati documentation project. Anastasia Panova is a PhD student at Stockholm University who contributed to the project through her dissertation on Gawar-Bati grammar. Fakhruddin Akhunzada is associated with the Forum for Language Initiatives and co-authored the paper "Preliminary Documentation of Dameli, Gawarbati, Ushojo and Yidgha Languages of Northern Pakistan." Abdullah Soan is a native speaker of Gawar-Bati and contributed to the phonetic analysis of the language. Naseem Haider is affiliated with the Forum for Language Initiatives and co-authored works on the Palula language. Fazle Akbar is a native speaker of Gawar-Bati and contributed to the phonetic analysis of the language
Citation guidelines
Here’s a BibTeX entry you can use to cite the Gawar-Bati corpus or related publications about it: @misc{liljegren2025gawarbati, title = {Gawar-Bati: Documenting a Vulnerable Linguistic Community in the Hindu Kush}, author = {Liljegren, Henrik and Panova, Anastasiia and Akhunzada, Fakhruddin}, year = {2025}, howpublished = {\url{https://www.su.se/english/research/research-projects/gawarbati-documenting-a-vulnerable-linguistic-community-in-the-hindu-kush}}, note = {Accessed: 2025-09-02} } You can also provide a citation for the preliminary documentation paper by the Forum for Language Initiatives: @article{akhunzada2019preliminary, title = {Preliminary Documentation of Dameli, Gawarbati, Ushojo and Yidgha Languages of Northern Pakistan}, author = {Akhunzada, =
Funding
For the Gawar-Bati corpus, the main sources of funding and support were academic and grant-based. You could include the following acknowledgement in the datasheet: --- Funding Acknowledgement The creation of the Gawar-Bati corpus was supported by: Stockholm University – Department of Linguistics, for project funding, research facilities, and fieldwork support. Swedish Research Council (Vetenskapsrådet) – for research grants supporting field documentation and corpus development. Forum for Language Initiatives (FLI), Pakistan – for logistical support, community engagement, and local coordination. Meesum Alam Sir.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.