This 150 M-parameter non-autoregressive speech generative model, developed by BharatGen, is designed for speaker-conditioned Text-to-speech in Telugu.
Our speaker-conditioned Text-to-speech model is a non-autoregressive speech generative model designed for Indian languages, consisting of two key components: an audio model and an enhanced duration predictor. Together, these components comprise approximately 150 million parameters. The audio model is based on continuous normalizing flows (CNFs) and transforms a simple distribution into a complex conditional audio distribution, p(missing audio speaker audio, text), using a neural network trained with flow-matching via vector field regression. To better handle the prosodic richness of Indian languages, we extend the standard duration predictor architecture. Unlike Voicebox, which uses only text and durations, our model incorporates a 3-second speaker prompt along with the text. This enables the duration predictor to extract speaker-specific prosodic cues from the reference audio, resulting in more accurate and natural duration estimates. The model is trained from scratch on publicly available Indian language datasets and is optimized for speech infilling tasks such as continuous sentence completion and cross-sentence completion. Architectural modifications were made throughout to adapt the system for the diverse phonetic, rhythmic, and intonational patterns of Indian languages. For any queries, please visit https://bharatgen.discourse.group/invites/BcouFsKk4g
MIT
BharatGen
Text-to-Speech Model
PyTorch
Restricted
Sector Agnostic
29/04/25 06:59:38
N.A
3.38 GB
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.