Text-to-speech synthesis model tailored to match a given speaker's voice sample.
We present a speaker-adaptive text-to-speech (TTS) system designed for generating high-quality, natural speech across multiple low-resource Indian languages, including Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi, Tamil, and Telugu. Built upon a diffusion-based framework with approximately 150 million parameters, our model integrates a speaker encoder and classifier-free guidance to capture speaker-specific characteristics, enabling effective zero-shot adaptation for both seen and unseen speakers. The core architecture extends the GradTTS framework, replacing speaker tags with embeddings derived from a 10-second reference audio sample, which conditions the denoising diffusion probabilistic model (DDPM) decoder for multi-speaker synthesis. To enhance prosody, we introduce an attention-based duration predictor that leverages a reference mel spectrogram alongside text embeddings, extracting speaker-dependent prosodic features and improving the naturalness of speech timing.
For any queries, please visit https://bharatgen.discourse.group/invites/BcouFsKk4g
MIT
Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Pranav Gaikwad, Prof. Ganesh Ramakrishnan
Text-to-Speech Model
PyTorch
Open
Sector Agnostic
22/05/25 11:31:41
1.62 GB
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.