ORGANISATION

IISc SYSPIN_S1.0 Corpus

SYSPIN_S1.0 is a large TTS corpus for nine Indian languages including a few low-resourced languages such as Bhojpuri, Chhattisgarhi, and Magahi. The corpus includes more than 47 hours of single speaker’s speech (1 male and 1 female per language) in each of nine Indian languages such as Hindi, Bengali, Marathi, Telugu, Bhojpuri, Kannada, Magahi, Chhattisgarhi, and Maithili. This corpus is unique in terms of duration per speaker and the variety of domains covered in the process of preparing the text in this corpora unlike existing TTS corpora in these languages. Validated audio and text files are made available to the public. This will potentially open up opportunities for academic researchers, students, small and large-scale industries and research labs to innovate and develop algorithms and text-to-speech synthesizers in all the Indian languages included in the SYSPIN project.

About Dataset

SYSPIN_S1.0 is the TTS corpus built as a part of the SYSPIN project at SPIRE lab, Indian Institute of Science (IISc) Bangalore, India. It is the current largest TTS corpus comprising more than 47 hours of single speaker’s speech (1 male and 1 female voice artist per language) in each of nine Indian languages such as Hindi, Bengali, Marathi, Telugu, Bhojpuri, Kannada, Magahi, Chhattisgarhi, and Maithili. This corpus is unique in terms of duration per speaker and the variety of domains covered in the process of preparing the text in this corpora unlike existing TTS corpora in these languages. Major domains considered are agriculture, finance, education, food, politics, social, Indic, local, health-care, technology, book continuous, sports, food, books, and websites. Books and website domains include sentences mined from the available printed textbooks as well as online sources. This corpus allows research and development in TTS including multi-lingual learning on studio-quality audio from multiple speakers, and different scripts for various languages, with part of the corpus having parallel sentences for speakers in a language. The SYSPIN dataset, along with baseline TTS models, is now available for download, ready to empower voice tech innovations in industries like agriculture, healthcare, education, and finance. As part of our mission, to advance multilingual, multi-speaker TTS systems, we organized three exciting challenges under SYSPIN: LIMMITS 23, LIMMITS 24 and LIMMITS 25. SYSPIN is more than just data – it's a foundation for inclusive and accessible voice technologies, shaping the future of digital communication in India. The corpus contains two category-Human Checked (HC) and NHC (Non Human Checked). Here is the summary of the duration of data included in the corpus. Data Size (HH:MM:SS) Data Size (HH:MM:SS) HC NHC Bhojpuri Male 47 hours:59 mins:1 secs 11 hours:38 mins:58 secs Bhojpuri Female 49 hours:3 mins:22 secs 11 hours:59 mins:11 secs Bengali Male 54 hours:29 mins:52 secs 6 hours:17 mins:19 secs Bengali Female 50 hours:44 mins:39 secs 9 hours:29 mins:27 secs Chhattisgarhi Male 49 hours:45 mins:25 secs 10 hours:58 mins:15 secs Chhattisgarhi Female 54 hours:48 mins:50 secs 5 hours:47 mins:18 secs Hindi Male 54 hours:57 mins:49 secs 4 hours:7 mins:49 secs Hindi Female 54 hours:54 mins:44 secs 5 hours:17 mins:11 secs Kannada Male 49 hours:19 mins:17 secs 9 hours:14 mins:45 secs Kannada Female 52 hours:21 mins:46 secs 9 hours:10 mins:33 secs Magahi Male 54 hours:39 mins:48 secs 6 hours:6 mins:36 secs Magahi Female 51 hours:25 mins:22 secs 10 hours:7 mins:30 secs Maithili Male 55 hours:50 mins:19 secs 3 hours:27 mins:47 secs Maithili Female 59 hours:40 mins:32 secs 0 hours:54 mins:9 secs Marathi Male 48 hours:38 mins:20 secs 13 hours:8 mins:25 secs Marathi Female 51 hours:3 mins:32 secs 7 hours:53 mins:44 secs Telugu Male 48 hours:20 mins:26 secs 8 hours:19 mins:31 secs Telugu Female 57 hours:21 mins:9 secs 1 hours:8 mins:31 secs