Common Voice Spontaneous Speech 1.0 - Betawi

Locale: bew

Size: 193.50 MB

Task: ASR

Format: MP3

License: CC-0


Betawi — Betawi (bew)

This datasheet is for version 23.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.

Language

Betawi language originally belongs to Austronesian language with a full name of Melayu-Betawi. This language is considered as one of Malay dialects, but historically it grew together with other major languages, such as Arabic, Hokkien, Sundanese, Javanese, and Malay in Sumatra - a tiny portion with Portuguese and Dutch. The language vitality status is Endangered according to https://www.ethnologue.com/language/bew/. At the moment, Indonesian standard and English in general influence the native speakers, allowing code switching and code mixing happens in a spontaneous speech. The specific variation of this dataset is Betawi Ora or Betawi Pinggiran (Peripheral Betawi), taken from several locations of Bekasi District/City, West Java Province, Indonesia. This variation is unique in terms of geo-politics: language is spoken only in the community, but it is not taught at school. Instead, the community is taught Sundanese language, which is dominated in West Java Province in general.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Transcriptions

The transcription system uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Writing system

Historically, this language used Pegon, Arabic script, but now Latin is adapted.The writing system in this dataset uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Symbol table
a b c d é è ȇ e f g h i j k l m n o p q r s t u v w y z
Questions

There follows a randomly selected sample of transcribed responses from the corpus.

Begimané pendidikan keluargé di masyarakat sekitar Ente?
Pigimané masyarakat di lingkungan Ente ngejagé atow melestarikan alam di sekitar?
Menurut Ente déwék, seberapé besar peran tuh kesenian buat acaré khusus?
Elmu apé nyang bakalan penting buat dipelajarin di masé depan?
Begimané caré kité ngedidik generasi mudé biar lebih peduli ngejagé lingkungan?
Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Kalo di sini mah pendidikan paling utama, ya minimal lulus SMA. Jadinya di sini mah bagan emaknya pada kuli nandur yang penting anaknya sekolȇh. Jadinya diusahain banget pendidikan di sini. Bagan boleh utang kék, èmaknya boleh kuli nandur, kuli nyuci, yang penting anaknyé sekolȇh.

Recommended post-processing

(1) Observe the non-linguistic aspects, such as filler, (2) Make sure your machine learning does not differ the suprasegmental aspect, like intonation which does not change the word and its meaning.

Community links

Contribute

Datasheet authors

Funding

This dataset was fully funded by the Open Multilingual Speech Fund.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.