Govtube - Kuña Rembiasa

License icon

License:

CC0-1.0

Shield icon

Steward:

Mozilla Foundation

Task: ASR

Release Date: 2/13/2026

Format: TSV, MP3

Size: 52.52 MB


Share

Description

This dataset was created by extracting the audio and subtitles from videos of the radionovela "Kuña Rembiasa" created by the Fundación Kuña Aty and published by the US Embassy in Paraguay. It contains 1 hour of audio, largely in Spanish, with around 10 minutes of mixed Colloquial Paraguayan Spanish and Guarani.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Processes

Intended Use

- Evaluating ASR for Paraguayan Spanish and Guarani-Spanish codemixing.

Metadata

Source

The dataset emcompasses the following radionovelas, downloaded from the YouTube channel of the US Embassy in Paraguay.

  • Kuña Rembiasa - Radionovela A la vuelta de la esquina [8:00]

  • Kuña Rembiasa - Radionovela Crimen sin fronteras [8:44]

  • Kuña Rembiasa - Radionovela La cruda realidad [8:05]

  • Kuña Rembiasa - Radionovela Malas Conexiones [8:09]

  • Kuña Rembiasa - Radionovela Negocio Familiar [9:39]

  • Kuña Rembiasa - Radionovela No vayas [8:55]

  • Kuña Rembiasa - Radionovela Tengo derechos, respetalos [7:35]

  • Kuña Rembiasa - Radionovela Todo queda en familia [8:07]

For a total of [1:07:14] of transcribed audio.

Original description

En un trabajo conjunto con la Fundación Kuña Aty, lanzamos la campaña Kuña Rembiasa que busca concienciar a la población sobre la problemática de la trata de personas.

La iniciativa consiste en la reproducción de avances de historias sobre situaciones de trata, en los televisores dispuestos en buses de larga distancia que parten diariamente desde distintos puntos del país. Estas tendrán continuidad en versión de audionovelas que presentamos en este espacio.

Processing

The data was first extracted from YouTube to give audio files (MP3) and the gold standard transcriptions (VTT). Then pydub was used to clip the audio according to the timestamps of the transcriptions and create a TSV file of audio clip and transcription. We then tagged each transcript with language tags, either es (Spanish only) or es,gn (Guarani and Spanish).

Format

The directory contains one subdirectory with the audio files and one TSV with the transcriptions.

audio/*.mp3
transcripts.tsv

The transcripts.tsv file has three columns:

  • audio_file

  • transcript

  • locales: es (Spanish) or es,gn (Mixed Spanish and Guarani)

Sample

a99433b20f1089be507296edbb96c7d1.000011.mp3	(Animosa) - ¡Dale Naty, ojalá que puedas irte con nohótra!! Avisame cuanto ante, porque hesa’ima ñande tiempo hina…	es,gn									
a99433b20f1089be507296edbb96c7d1.000012.mp3	Y Nathalia consiguió la aprobación de su familia para viajar con Jackie y así lo hicieron. Esta etapa de la Trata de Personas es la que se conoce como La Captación.	es			
a99433b20f1089be507296edbb96c7d1.000013.mp3	Nde Jacinta y cuando llegamo alla piko donde vamo vivir? Ay mi querida! No te vas a ite que preocupar por eso! Arreglado pa ite pa ko pea amo.	es,gn					

Recommended postprocessing

The transcripts contain a large number of special symbols and numerals, depending on the task, these should be stripped before training or evaluation. The symbol # in the transcript refers to a newline in the original subtitle. There are a number of instances of affect indicated by expression between parentheses, for example (Animosa). Music is also indicated in this way (Música). There are no other instances of the use of parentheses, so these can be stripped with a simple regular expression.