SYSPIN Telugu Male TTS Dataset

Description

As a part of the SYSPIN project, we are releasing 40 hours of studio recorded Telugu male TTS data. Validated audio and text files are made available to the public. This will open up opportunities for academic researchers, students, small and large-scale industries and research labs to innovate and develop algorithms and text-to-speech synthesizers in all the nine Indian languages included in this project proposal. It will also bring competitiveness among different research groups in coming up with ideas for developing high-quality synthesized speech to improve voice-based services. Open voice data is the foundation for local AI innovators to build applications that are geared towards the specific capabilities and requirements of users in India.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

SYSPIN Telugu Male TTS Dataset — Technical Datasheet

Version: S1.0
Released by: SPIRE Lab, Indian Institute of Science (IISc), Bengaluru, India
Dataset URL: https://syspin.iisc.ac.in/datasets/telugu%20male%20tts%20data
Download Portal: https://spiredatasets.ee.iisc.ac.in/syspincorpus
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Citation: Abhayjeet et al., "SYSPIN_S1.0 Corpus — A TTS Corpus of 900+ hours in nine Indian Languages", 2025

Overview

The SYSPIN Telugu Male TTS Dataset is a single-speaker, studio-recorded, read-speech corpus for Text-to-Speech (TTS) synthesis in the Telugu language, spoken by a professional male voice artist. It is one of eighteen speaker-language datasets produced under the SYSPIN (SYnthesizing SPeech in INdian languages) initiative — a large-scale, open-source effort by SPIRE Lab at IISc Bengaluru to build high-quality TTS corpora across nine Indian languages: Bengali, Bhojpuri, Chhattisgarhi, Hindi, Kannada, Magahi, Maithili, Marathi, and Telugu

Project Context

Field	Detail
Project name	SYSPIN (SYnthesizing SPeech in INdian languages)
Research lab	SPIRE Lab, Dept. of Electrical Engineering, IISc Bengaluru
Project lead	Prof. Prasanta Kumar Ghosh, IISc Bengaluru
Funding	German Development Cooperation (GIZ) / ARTPARK
Sister project	RESPIN (ASR corpus, same language set)
Total corpus size	920 hours across 9 languages, 18 speakers, 462,311 sentences

Dataset Summary

Property	Value
Language	Telugu (ISO 639-1: `te`)
Script	Telugu script (Unicode block U+0C00–U+0C7F)
Speaker gender	Male
Number of speakers	1 (single speaker)
Speaker type	Professional native voice artist
Duration	~40 hours
Domains	agriculture, books, education, finance, general, health, others, politics, running text from book, weather
Data type	Read speech (scripted utterances)

Audio Specifications

Property	Value
File format	WAV (PCM)
Sample rate	48,000 Hz
Bit depth	24-bit
Channels	Mono
Encoding	Linear PCM
Background noise	Absent (studio-controlled environment)

Corpus Structure

telugu_male_tts/
├── wav/                    # Audio files (WAV, 48 kHz, 24-bit, mono)
│   ├── ....wav
│   └── ...
├── ..._Transcripts.json            # transcripts

the Transcripts.json file contains a "Transcripts" key that looks as follows:

    "Transcripts": {
        "IISc_SYSPINProject_te_m_GENE_02044": {
            "Transcript": "ఇంపెడెన్స్ విశ్లేషణ ఉపయోగించి బయోమెట్రిక్ సమాచారాన్ని కొలిచే స్మార్ట్ స్కేల్స్ అడుగుల ద్వారా తేలికపాటి విద్యుత్ ప్రేరణలను పంపుతాయి. ",
            "Domain": "GENERAL"
        },
        "IISc_SYSPINProject_te_m_EDUC_02060": {
            "Transcript": "\"అనుష్కతో కలిసి మరీన్ పరేడ్ బీచ్లో బెంచ్పై కూర్చొని ఆస్వాదించిన క్షణాలు జీవితాంతం గుర్తుండి పోతాయి\" అని బిసిసిఐ టివికి ఇచ్చిన ఇంటర్వ్యూలో కోహ్లి తెలిపాడు.",
            "Domain": "EDUCATION"
        },
        ...

License & Attribution

This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Users are free to share and adapt the material for any purpose, including commercial use, provided appropriate credit is given.

Required Citation

Abhayjeet et al., "SYSPIN_S1.0 Corpus — A TTS Corpus of 900+ hours
in nine Indian Languages", 2025.