Common Voice Spontaneous Speech 1.0 - Betawi
Locale: bew
Size: 193.50 MB
Task: ASR
Format: MP3
License: CC-0
Betawi — Betawi (bew
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Spontaneous Speech dataset
for Betawi (bew
). The dataset contains 11 hours of recorded
speech (11 hours validated) from 21 speakers.
Language
Betawi language originally belongs to Austronesian language with a full name of Melayu-Betawi. This language is considered as one of Malay dialects, but historically it grew together with other major languages, such as Arabic, Hokkien, Sundanese, Javanese, and Malay in Sumatra - a tiny portion with Portuguese and Dutch. The language vitality status is Endangered according to https://www.ethnologue.com/language/bew/. At the moment, Indonesian standard and English in general influence the native speakers, allowing code switching and code mixing happens in a spontaneous speech. The specific variation of this dataset is Betawi Ora or Betawi Pinggiran (Peripheral Betawi), taken from several locations of Bekasi District/City, West Java Province, Indonesia. This variation is unique in terms of geo-politics: language is spoken only in the community, but it is not taught at school. Instead, the community is taught Sundanese language, which is dominated in West Java Province in general.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, frequency refers to the number of clips annotated with this gender.
Age
Self-declared age information, frequency refers to the number of clips annotated with this age band.
Transcriptions
The transcription system uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.
Writing system
Historically, this language used Pegon, Arabic script, but now Latin is adapted.The writing system in this dataset uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.
Symbol table
a b c d é è ȇ e f g h i j k l m n o p q r s t u v w y z
Questions
There follows a randomly selected sample of transcribed responses from the corpus.
Begimané pendidikan keluargé di masyarakat sekitar Ente?
Pigimané masyarakat di lingkungan Ente ngejagé atow melestarikan alam di sekitar?
Menurut Ente déwék, seberapé besar peran tuh kesenian buat acaré khusus?
Elmu apé nyang bakalan penting buat dipelajarin di masé depan?
Begimané caré kité ngedidik generasi mudé biar lebih peduli ngejagé lingkungan?
Responses
There follows a randomly selected sample of transcribed responses from the corpus.
Kalo di sini mah pendidikan paling utama, ya minimal lulus SMA. Jadinya di sini mah bagan emaknya pada kuli nandur yang penting anaknya sekolȇh. Jadinya diusahain banget pendidikan di sini. Bagan boleh utang kék, èmaknya boleh kuli nandur, kuli nyuci, yang penting anaknyé sekolȇh.
Recommended post-processing
(1) Observe the non-linguistic aspects, such as filler, (2) Make sure your machine learning does not differ the suprasegmental aspect, like intonation which does not change the word and its meaning.
Community links
https://referensi.data.kemendikdasmen.go.id/budayakita/wbtb/objek/AA000491
https://petabahasa.kemdikbud.go.id/ (Web of peta bahasa does not consider Betawi language is part of Indonesia, particularly in Jakarta and West Jawa Province.
Contribute
Datasheet authors
Yacub Fahmilda (yacub.fahmilda@gmail.com)
Riska Legistari Febri (Riskalegistari25@gmail.com)
Funding
This dataset was fully funded by the Open Multilingual Speech Fund.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.