Common Voice Scripted Speech 23.0 - Cornish

Locale: kw

Size: 100.89 MB

Task: ASR

Format: MP3

License: CC-0


Kernowek — Cornish (kw)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Cornish (kw). The dataset contains 13 hours of recorded speech (13 hours validated) from 10 speakers.

Language

Cornish, or Kernewek, is a Brythonic language, alongside Breton and Welsh, and part of the Celtic Indo-European language family. It is an indigenous language of the United Kingdom, with most speakers located in Cornwall. In the 2021 UK Census 567 people self-identified Cornish as their main language. UNESCO has classified its status as "severely endangered".

Variants

There are currently no variants defined for Cornish.

Predefined

There are currently no pre-defined accents.

User defined

There are currently no user-defined accents.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

GenderFrequency
male, masculine0
female, feminine3,550
undeclared5,807

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Age bandFrequency
teens0
twenties0
thirties0
fourties3,540
fifties3,778
sixties241
seventies496
undeclared1,302

Text corpus

The dataset contains 10.8 validated hours of speech from 10 unique contributors.

TypeCountHours
Validated Clips9,35710.8
Invalidated Clips00.00
Total Clips9,35710.8
  • Average sentence length (tokens): 6.4

  • Average sentence length (characters): 31

Writing system

Cornish has several writing systems in place. The majority of this dataset uses the Standard Written Form, established in 2008.

Symbol table

The dataset uses the following characters: ' - ! , . ? a b c d e f g h i j k l m n o p r s t u v w x y z

Sample

There follows a randomly selected sample of five sentences from the corpus.

  • A yllyn ni redya hemma?

  • Marthys ens i.

  • Esos. Yth esos ta ena y'n kornel.

  • A wrussyn ni diwrosa yn uskis?

  • Dha leveryans yw nebes da.

Sources

The text for this dataset comes from the following sources:

  • IndyLan Cornish course. Author: Cornish Language Office. Standard Written Form.

  • Individual sentences submitted by users through the Mozilla Common Voice interface (public domain)

Text domains

  • General — The majority of this dataset focuses on conversational phrases with the intention to cover a broad range of grammatical points.

Recommended post-processing

  • Check the data for Unicode errors in the Cornish. These should be the character '.

Contribute

Datasheet authors

Sam Rogerson cornishlanguage@cornwall.gov.uk

Funding

This dataset was partially funded by the Open Multilingual Speech Fund.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.