MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Community

Bangor Talk Siarad Welsh-English corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and 450,000 words
License Icon

License: GPL-3.0

Locale Icon

Locale: cym

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA. TSV

Size Icon

Size: 2.13 GB

Mozilla

Test

Test
License Icon

License: EUPL-1.2

Locale Icon

Locale: Test

Task Icon

Task: LID

Format Icon

Format: Test

Size Icon

Size: 50.79 MB

Mozilla

swdad

awdad
License Icon

License: Apache-2.0

Locale Icon

Locale: dadawd

Task Icon

Task: NLP

Format Icon

Format: adwad

Size Icon

Size: 15.36 MB

Mozilla

awdadad

wadadwwadawd
License Icon

License: Apache-2.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: wadada

Size Icon

Size: 249.04 MB

Mozilla Foundation

Govtube - Kuña Rembiasa

Audio consisting of 1 hour of Spanish and Guarani (approx. 10%) collected from the US Embassy in Paraguay on the subject of human trafficking.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-PY, gn-PY

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 52.52 MB

Mozilla Foundation

A new test with the new flow v2

A basic description would go here
License Icon

License: BSD-3-Clause

Locale Icon

Locale: en-US

Task Icon

Task: MT

Format Icon

Format: Unknown

Size Icon

Size: 249.04 MB

Common Voice

versions

test description
License Icon

License: BSD-3-Clause

Locale Icon

Locale: mul

Task Icon

Task: ML

Format Icon

Format: MP3

Size Icon

Size: 15.36 MB

Rotimi very very very very very long org name

Wonderful new dataset

Short description of the dataset
License Icon

License: BSD-3-Clause

Locale Icon

Locale: en-US

Task Icon

Task: LID

Format Icon

Format: mp3

Size Icon

Size: 1.82 MB

My Cool Organization Changed Again

Long Other Information Description Dataset

Long Other Information Description Dataset
License Icon

License: Apache-2.0

Locale Icon

Locale: en

Task Icon

Task: NLP

Format Icon

Format: WAV

Size Icon

Size: 4.20 MB

Rotimi very very very very very long org name

Dataset with long & short desc

I added a short dec
License Icon

License: CC-SA-1.0

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: mp3

Size Icon

Size: 15.36 MB

Mozilla

dawdawd

adwada
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: en-us

Task Icon

Task: NLP

Format Icon

Format: adwad

Size Icon

Size: 231.89 MB

Rotimi very very very very very long org name

Dataset with long desc

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en-US

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 15.36 MB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

An illustration of a floppy disks

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

An illustration of a floppy disks

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.