Common Voice Scripted Speech 23.0 - Cornish
Locale: kw
Size: 100.89 MB
Task: ASR
Format: MP3
License: CC-0
Kernowek — Cornish (kw
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset
for Cornish (kw
). The dataset contains 13 hours of recorded
speech (13 hours validated) from 10 speakers.
Language
Cornish, or Kernewek, is a Brythonic language, alongside Breton and Welsh, and part of the Celtic Indo-European language family. It is an indigenous language of the United Kingdom, with most speakers located in Cornwall. In the 2021 UK Census 567 people self-identified Cornish as their main language. UNESCO has classified its status as "severely endangered".
Variants
There are currently no variants defined for Cornish.
Predefined
There are currently no pre-defined accents.
User defined
There are currently no user-defined accents.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, frequency refers to the number of clips annotated with this gender.
Gender | Frequency |
---|---|
male, masculine | 0 |
female, feminine | 3,550 |
undeclared | 5,807 |
Age
Self-declared age information, frequency refers to the number of clips annotated with this age band.
Age band | Frequency |
---|---|
teens | 0 |
twenties | 0 |
thirties | 0 |
fourties | 3,540 |
fifties | 3,778 |
sixties | 241 |
seventies | 496 |
undeclared | 1,302 |
Text corpus
The dataset contains 10.8 validated hours of speech from 10 unique contributors.
Type | Count | Hours |
---|---|---|
Validated Clips | 9,357 | 10.8 |
Invalidated Clips | 0 | 0.00 |
Total Clips | 9,357 | 10.8 |
Average sentence length (tokens): 6.4
Average sentence length (characters): 31
Writing system
Cornish has several writing systems in place. The majority of this dataset uses the Standard Written Form, established in 2008.
Symbol table
The dataset uses the following characters: ' - ! , . ? a b c d e f g h i j k l m n o p r s t u v w x y z
Sample
There follows a randomly selected sample of five sentences from the corpus.
A yllyn ni redya hemma?
Marthys ens i.
Esos. Yth esos ta ena y'n kornel.
A wrussyn ni diwrosa yn uskis?
Dha leveryans yw nebes da.
Sources
The text for this dataset comes from the following sources:
IndyLan Cornish course. Author: Cornish Language Office. Standard Written Form.
Individual sentences submitted by users through the Mozilla Common Voice interface (public domain)
Text domains
General — The majority of this dataset focuses on conversational phrases with the intention to cover a broad range of grammatical points.
Recommended post-processing
Check the data for Unicode errors in the Cornish. These should be the character
'
.
Contribute
Datasheet authors
Sam Rogerson cornishlanguage@cornwall.gov.uk
Funding
This dataset was partially funded by the Open Multilingual Speech Fund.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.