License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 12/5/2025
Format: MP3
Size: 172.41 MB
Share
A collection of spontaneous spoken phrases in Toba Qom.
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
tob)This datasheet is for version 2.0 of the the Mozilla Common Voice Spontaneous Speech dataset
for Toba Qom (tob). The dataset contains 1573 clips representing 12 hours of recorded
speech (11 hours validated) from 25 speakers.
The Toba Qom language is an endangered language spoken in Gran Chaco, a region spanned over Argentina, Paraguay and Bolivia. As per the official demographic data provided by the Argentinian state, the population of Qom individuals is estimated at 80,000, of which approximately 49% are speakers of the oral form of the language. The term "qom" describes a population that has traditionally been arranged into multiple extended families or groups. Language and sociocultural traits that are essential to qom culture are shared by these groups, which are traditionally hunter-gatherer.
The contributors to this corpus originate from Chaco and Formosa provinces in Argentina. This area encompasses four ethnodialectal subregions with distinct self-identification terms (Messineo, 1991) 1.
| Area | Province | Locations | Variant (self-identification) |
|---|---|---|---|
| Northwest | Chaco | El Colchwón, El Espinillo and the Bermejo river’s surroundings | dapigemlʔek |
| Northcenter | Chaco | Pampa del Indio | noʔolgaGanaq |
| Southcenter | Chaco | Sáenz Peña, Machahay, Quitilipi | lʔañaGashek |
| Southeast | Chaco, Eastern Formosa | Las Palmas, Clorinda | takshek |
For further information, see 2 1 3.
| Split | Count |
|---|---|
| Train | 946 |
| Test | 367 |
| Dev | 341 |
Prompts: 136
Duration: 11:02:08 [h:m:s]
Avg. Transcription Len: 133
Avg. Duration: 25.26[s]
Valid Duration: 38546.136[s]
Total hours: 11.04[h]
Valid hours: 10.71[h]
This corpus consists of 1350 utterances approximately, totalling 10hs of transcribed speech. The dataset does not focus on any particular domain or topics. The question set consists of 150 instances, covering a wide range of general topics about lifestyles and culture (hobbies, education, traditions, nature, food, society, technology, relationships, art, etc). Speakers responded to each question based on their personal belief, experiences, and knowledge, mainly to describe their culture or share their personal opinion about how they interact within the society (e.g. how they would find a lawyer, how they make a medical appointment, etc).
The dataset does not contain any data that might be considered sensitive for others, to the best of the author's knowledge.
The data collection involved a coordinator (a PhD student), a linguist known by the Qom contributors (researcher), and three field-work assistants (linguists). The data was collected mainly by the Qom contributors using their own phones at home, after receiving technical training. A small proportion of data was recorded in an academic setting (e.g. research institute) during the training phase.
The transcriptions follows the orthographic systems proposed by Buckwalter (2001) 2
a c ch d e g hu i j l ll m n ñ o p q qu r s sh t u v x y ỹ ’
There follows a randomly selected sample of questions used in the corpus.
¿Negue’t ca ỹataqta ’anqopita can ñaq edaxaatac?
¿Negue’t na dalaxaic no’onatac ’auaỹaten da ’au’ot nagui?
¿’Eetec ca n’qochenaxac na shiỹaxaupi yi ’adma’ nquicapiguic da qalota?
Eetec na napo yi arma?
¿Negue’t taxa ca ‘auauotaique da lmenec ca qoueta’a ’adma’?
There follows a randomly selected sample of transcribed responses from the corpus.
Aýem ýataqta yoqopita can ñaq sedaxaatac satapoigui asa'aso paxaguenaxaqui dam ýataqta yoqopita naxa da qai'axaia asa'aso huo'o da l-lamaxa ichoxot da ne'enapi napaxaguetacpi ñaqpiolec cad'ac nmatec cada'ac da'ashe nache da qai'axaia asa'aso natoina nache huo'o da lpe'e na ñaqpiolec naxa eso so ýoqta yoqopita
Aýem na saýaten nagui saýaten da ño'oxosheguem na noýic
Da huo'o ña nqui'c shiyaxaua cha'aye nachena na jec ilotaique cam l'onatac dam yanatac cam ilotaique nache ishet da machiguiñi
yi imaa' chaco onaxaic shinatap .cahioloqta da nopo
Yi 'ima' qaiuen aca menaxanaxaqui da ýotta'a't na lmenec nallec
To be updated in the next release. Contact the author for details.
Each row of a tsv file represents a single audio clip, and contains the following information:
client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker4
gender - gender of the speaker4
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
transcription-length - character per second under 3 characters per second
speech-rate - characters per second over 30 characters per second
short-audio - audio length under 2 seconds
long-audio - audio length over 30 seconds
Belu Ticona <mticonao@gmu.edu>
Paola Cúneo
Antonios Anastasopoulos
B. Ticona, P. Cuneo. A. Anastasopoulos. “Datasheet of Spontaneous Speech Corpus for Qom - Mozilla Common Voice”. Revised on Aug 29th, 2025. [Publication Date].
The speaker collaborators were funded by Mozilla Common Voice. The project coordinator was partially funded by the US NSF grants 2346334 and 2439202.
This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.
Messineo, Cristina. 1991. Variantes dialectales del complejo lingüístico toba. Hacia una nueva carta étnica del Gran Chaco II: 12-22. Las Lomitas: Centro del Hombre Antiguo Chaqueño. ↩ ↩2
Buckwalter, Alberto. 2001 [1980]). Vocabulario toba. Formosa / Indiana: Equipo Menonita / Mennonite Board of Missions. Ed. Revisada. ↩ ↩2
Messineo, Cristina. 2003. Lengua Toba (guaycurú). Aspectos gramaticales y discursivos. Lincom Studies in Native American Linguistics 48. Münich: Lincom Europa. ↩
For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2