Common Voice Spontaneous Speech 2.0 - Basaa

Description

A collection of spontaneous spoken phrases in Basaa.

Specifics

Licensing

CC0 1.0 Universal

https://creativecommons.org/publicdomain/zero/1.0/legalcode

Considerations

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Basaa — Basaa (`bas`)

This datasheet is for version 2.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Basaa (bas). The dataset contains 773 clips representing 6 hours of recorded speech (6 hours validated) from 11 speakers.

Language

Basaa is a narrow Bantu language spoken across a geographical area spanning three administrative regions in Cameroon: the Centre, Littoral and South regions. It is estimated that there are currently around 600,000–700,000 speakers. This figure includes different varieties, as well as diasporic populations who identify as Basaa speakers.

The vitality of the Basaa language is stable (Ethnologue online). However, intergenerational transmission of Basaa is increasingly threatened among parents aged 50 and under, particularly in urban areas.

Although Basaa is taught in schools, this does not significantly impact the vitality of the language, mainly due to the current pedagogical approach, which relies on rule-based and descriptivist teaching methods.

The glossonym 'Basaa' is a generic term that encompasses a range of varieties, the speakers of which may identify with the 'Basaa' label to varying degrees, depending on a complex set of geographical, social, political, situational and pragmatic factors. Whether a language variant is considered Basaa depends greatly on the perspective of the person 'telling the story'. Some of the most commonly acknowledged varieties of Basaa include:

Mbene
Bikok
Babimbi
Basaa ba Omeng
Basaa ba Yabasi Basaa ba Duala
Ndog-Bikim

Other varieties, such as Ndonga, Mbaa (also known as Mbay-Bati) and Hijuk, may also be classified as Basaa. However, as previously mentioned, not everyone agrees on this classification.

Data splits for modelling

Split	Count
Train	220
Test	291
Dev	261

Transcriptions

Prompts: 74
Duration: 5:22:34 [h:m:s]
Avg. Transcription Len: 283
Avg. Duration: 25.04[s]
Valid Duration: 18232.452[s]
Total hours: 5.38[h]
Valid hours: 5.06[h]

Writing system

The prompts and responses in this dataset are written in the Latin alphabet, following the orthography of Protestant missionaries but with modifications introduced by the dataset's author. One such modification is the use of an apostrophe before the symbols 'y' and 'b' to signal nasal prefixes. For example: 'me n'yo': 'stealing palm wine from the palm trunk' (as opposed to me nyo, meaning 'drinking'), and m'bôñ, meaning 'poison'. 'cassava' (vs. 'mbôñ': 'poison'). As a general rule, the apostrophe signals 'accidentals' of a morphological or prosodic nature.

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

Kii ba nsébél ni hop u basaa le "litat matat ma ntôdôô" ?
Mambee maéba u nla ti babiina i kel libii jap ?
Inyu kii ba nkal le "mahôla ma mbôk bé bak" ?
Ba nla nugna mbôngôô ni mimbee mintén mi bijek ?
Kii ba nsébel njibngañ ?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Litat matat ma ntôdôô ni hop u Basaa li yé jam,  ntuk u matjañ. U yé le ba nkal we le u tat yom, ndi u tadak u keblak. Hala nyen ba nsébél le matat ma ntôdôô. Mut nu a ntat bé banga ntat nta. Mut nu a yé le, iyom ba nkal nye le a boñ a m'boñ bé. A m'boñ, a hogok a tjagak kôô i lép kiki ba nkal. Matat ma ntôdôô ma.
Imaéba me nti babiina i kel libii jap ma yé le, mulôm a gwés ñwaa wéé, ñwaa wéé a n'nôgôl nye.
Mahôla ma m'bôk bé bak, hala wee mut a nlama bé gwés man-isañ iloo nyemede. Iba le u nhôla mut, hôla mahôla ma nlama ba ni ngim hihéga. Mut a ngwés bé mut iloo nyemede.
Ibijek ba nla nugna ni mbôngôô, gwo bini le gwôô, masôô, manyogi, manga, bobola, makabô, m'bôñ.
Njibngañ, di nla eeh kal le i yé kiki bo ny bôm-be. Yom i i nlona mut ndutu i nyuu. Yom i i m'boñ mut le a bana mam, ma ma nlona nye ndutu.; ma ma nti bé nye nsañ ; ma ma nla yak  kuuha nye ndutu ngandak. Iyom i yon ba nsébél le njibngañ. Mut nu a gwéé yom kiki bo mbom-be. Bom-be lôñni njibngañ bi nhek pôôna. Njibngañ i nlona mut ndutu ; i m'boñ nye le mam malam ma lôl bañ nye, ndigi mam mabe.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 30 seconds

Get involved!

Community links

Contribute

Acknowledgements

The recording of spontaneous speech for this dataset was made with volunteer contribution from individuals who are not cited here for privacy reasons, but whose invulable contribution is acknowledged.

Datasheet authors

Emmanuel Ngue Um <ngueum@gmail.com>

Funding

The compilation of this dataset was made possible thanks to grant awarded by the Mozilla Foundation

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2