Bangor Talk Siarad Welsh-English corpus

License icon

License:

GPL-3.0

Shield icon

Steward:

Community

Task: ASR

Release Date: 2/20/2026

Format: MP3, CHA. TSV

Size: 2.13 GB


Share

Description

The Siarad Welsh-English corpus, containing around 450,000 words, 84% Welsh, 4% English, 13% indeterminate (the relevant word appears in the dictionaries of both main languages). The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.

Specifics

Licensing

GNU General Public License v3.0 or later (GPL-3.0)

https://spdx.org/licenses/GPL-3.0-or-later.html

Considerations

Restrictions/Special Constraints

In line with the GPLv3 licence, note that permission is NOT granted to use any of the material on this website to train an AI large language model UNLESS all the training data for that LLM is made publicly available.

Forbidden Usage

N/A

Processes

Intended Use

Research on bilingualism and language contact, code-switching ASR

Metadata

The Siarad Welsh-English corpus contains around 450,000 words, 84% Welsh, 4% English, 13% indeterminate (the relevant word appears in the dictionaries of both main languages).

Overview

This corpus, along with the Bangor and Miami corpora, were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.

The Siarad corpus (version 1) was originally distributed on CD in 2009, with manual glosses only, and is also available on TalkBank. The Siarad data on this website (version 1.5) includes both manual and automatically-generated glosses. Version 2 of Siarad texts, including some corrections and emendations, with manual glosses only, is available in a GitHub repository

Citation

Please refer to the corpus as the ‘Bangor Siarad’ corpus, and provide a link to this MDC dataset if you download it from here.

Please also cite:

Deuchar, M., P. Davies, J. Herring, M. Parafita Couto, and D. Carter (2014). Building bilingual corpora. In: E. M. Thomas and I. Men- nen (Eds.), Advances in the Study of Bilingualism, pp. 93–111. Bristol: Multilingual Matters.

Processing

The audio files were transcribed using CHAT conventions. The Siarad corpus was initially glossed manually. Subsequently, all three corpora were glossed automatically.

Format

The dataset contains the following directories:

  • audio/ contains the mp3 files

  • chat/ contains CHAT files with transcriptions and glosses of the audio

  • word_level_tsvs/ contains the trancriptions and glosses corresponding to each audio/CHAT file, one word per line.

  • metadata/ contains speaker metadata and the speaker questionnaires.

The Siarad_doc.pdf file contains the complete documentation for the corpus.

Funding

The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.

Further information

For further information, please contact Peredur Webb-Davies.

More details about the corpora can be found in:

  • The monograph Building and Using the Siarad Corpus: Bilingual conversations in Welsh and English (Margaret Deuchar, Peredur Webb-Davies, and Kevin Donnelly), published by John Benjamins (2018). The first part of the book describes the methods used to build the Siarad corpus, while the second part describes various linguistic analyses of the corpus data.

  • The book chapter "Building bilingual corpora" (Margaret Deuchar, Peredur Davies, Jon Russell Herring, M. Carmen Parafita Couto and Diana Carter). In: E Môn Thomas and I Mennen (Eds.), Advances in the Study of Bilingualism (2014) pp. 93–110. Multilingual Matters.

  • The paper "Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text"" (Kevin Donnelly and Margaret Deuchar). In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. NEALT Proceedings Series, Tartu.