Kashmiri TTS Single Speaker Dataset

This dataset has been primarily developed to facilitate the creation of text-to-speech systems for the Kashmiri language, a digitally underrepresented language predominantly spoken in the Jammu and Kashmir region of India.

About Dataset

The textual data was sourced from various publicly accessible and individual contributors, including scholars and students. The text underwent filtration using an enrichment algorithm to ensure quality and relevance. A young male voice, approximately 25 years of age, was selected to record the speech with a sample rate of 48,000 Hz ensuring high quality audio suitable for detailed phonetic analysis and machine learning applications.. A web application was developed to streamline the process of recording, reviewing, saving, deleting, and flagging inaccurate text entries.The dataset comprises two folders: 'Recordings' and 'Text Files'. 'Recordings' folder has 2984 audio recordings each saved in separate WAV file . 'Text Files' folder contain a single 'textcorpus.csv' file. The CSV file has two columns: an 'id' column that links each entry in the 'sentence' column to its corresponding WAV file in the 'Recordings' folder. The WAV files are named according to the 'id' of the sentence in the 'textcorpus.csv' file, ensuring a systematic and consistent file organization. The citation of the dataset - SShafi, Kh Mohmad; Bhat, Asif Ali; Imtiyaz, Kamran ; Iqbal, Javaid (2024), “KTTS Single Speaker Dataset”, Mendeley Data, V2, doi: 10.17632/5c4dcvxdmb.2 This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support Research And Development In Kashmiri Language Processing And Low-resource Language Technologies. The Dataset Can Be Used To Develop And Train Machine Learning Models For Various Natural Language Processing Tasks, Including Language Modeling, Sentiment Analysis, Text Classification, And Language Understanding. It Also Supports The Creation Of Language Learning Resources And Educational Materials For Kashmiri Language Learners. Additionally, The Dataset Enables Linguistic Analysis Of Kashmiri Syntax, Semantics, And Vocabulary Patterns, Helping Researchers Identify Language Trends And Structures For Nlp Applications. The Dataset Can Further Support The Development Of Chatbots, Virtual Assistants, And Multilingual Ai Systems Capable Of Understanding And Processing Kashmiri Language Input.

Dataset Metadata

License

Attribution 3.0 Unported (CC BY 3.0)

Geographical coverage

India

Sector

Science, Technology and Research

Author

Kh Mohmad Shafi, Asif Ali Bhat, Kamran Imtiyaz, Javaid Iqbal

Source Organisation

Digital India BHASHINI Division

Uploaded by

Nikil Augustine

Data Quality Score (Beta)

2.75

Dataset type

Structured

Frequency

Time Granularity

Static

Year range

N.A.

Date & Time

07/05/26 07:56:19

Visibility

Open

Primary Key / Indicator

Hosted / Redirected

Hosted

Data Type

Hybrid

Data Collection Method

The Textual Data Was Sourced From Various Publicly Accessible And Individual Contributors, Including Scholars And Students. The Text Underwent Filtration Using An Enrichment Algorithm To Ensure Quality And Relevance. A Young Male Voice, Approximately 25 Years Of Age, Was Selected To Record The Speech. A Web Application Was Developed To Streamline The Process Of Recording, Reviewing, Saving, Deleting, And Flagging Inaccurate Text Entries.