ORGANISATION

SpeeD-TB - Meitei

A speech and transcription dataset for the Meitei (Manipuri) language containing raw and transcribed audio collected through multiple elicitation and recording methods to support technology, NLP research and language preservation.

About Dataset

This dataset is a collection of raw and transcribed speech data for the Meitei language, curated under the SpeeD-TB project for Tibeto-Burman Indian languages. The data will be collected approximately from 80-100 speakers, and mostly from the age groups 20-50 years. The dataset is mostly collected from education, agriculture and science and technlogy domain. It contains audio recordings of approximately 5.5 hours and corresponding transcriptions. The dataset contributes to digital language preservation and computational resource development for Meitei, which is an important Tibeto-Burman language spoken in India. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support The Development Of Speech Technologies And Natural Language Processing Tools For The Meitei Language. It Can Be Used For Automatic Speech Recognition, Speech-to-text Modeling, Multilingual Ai Systems, Linguistic Analysis, Language Modeling, Transcription Research, Low-resource Language Technology Development, And Preservation Of Linguistic And Cultural Heritage Associated With The Meitei Language.

Dataset Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Geographical coverage

India

Sector

Science, Technology and Research

Author

Ritesh Kumar, Siddharth Singh, Speed-TB Project

Source Organisation

Digital India BHASHINI Division

Uploaded by

Nikil Augustine

Data Quality Score (Beta)

4.25

Dataset type

Structured

Frequency

Static

Time Granularity

Year range

01/04/2022 - 31/03/2024

Date & Time

02/06/26 08:45:16

Visibility

Open

Primary Key / Indicator

Hosted / Redirected

Redirected

Data Type

Hybrid

If Redirection which source

Https://github.com/speed-tb/meitei

Data Collection Method

Speech Samples In The Meitei Language Were Collected And Curated Through Multiple Elicitation And Recording Methods, Including Field Interviews, Narrations, Translations, Lectures, Questionnaires, And App-supported Linguistic Data Collection Workflows. The Collected Audio Recordings Were Transcribed And Organized Into Structured Speech-text Resources To Support Supervised Machine Learning By Using Praat Software.

Activity Overview

0
5
2.21 GB
40

License Control

Attribution 4.0 International (CC BY- 4.0)

SpeeDTBMeitei ( 1 files, 1 directories )

audio

1512 files

transcription.json

823.59 KB

Data Quality Score Beta

Version Control

Version 1(2.21 GB)

Nikil Augustine·1 month(s) ago
- SpeeDTBMeitei
  audio
  transcription.json

Related Datasets

Updated 1 month(s) ago

SpeeD-TB - Kokborok

A speech and transcription dataset for the Kokborok language containing raw and transcribed audio collected through multiple elication and recording methodsto support technology, NLP research and language preservation.

Multilingual AI

Natural language processing

Language preservation

Language modeling

Kokborok language

Speech dataset

Automatic speech recognition

Speech-to-text

Low-resource languages

Transcribed speech

Audio corpus

Tibeto-Burman languages

Speech technology

Indic languages

Computational linguistics

0
10
2.12 GB
53

DIGITAL INDIA BHASHINI DIVISION

View Details

Updated 2 month(s) ago

SpeeD-TB-Toto

200 hours of transcribed speech dataset of Toto, an endangered Tibeto-Burman language spoken in Totopara village of West Bengal, developed under the SpeeD-TB Project.

speech

Automatic Speech Recognition

ASR

audio

txo

MaTra Lab

Atekho

LiFE app

DIGITAL INDIA BHASHINI DIVISION

View Details

Accessibility options by UX4G

SpeeD-TB - Meitei

About Dataset

Purpose of Dataset

Dataset Metadata

Activity Overview

Tags

License Control

SpeeDTBMeitei ( 1 files, 1 directories )

audio

transcription.json

Data Quality Score Beta

Version Control

Version 1(2.21 GB)

SpeeDTBMeitei

audio

transcription.json

Related Datasets

AIKosh

Resources

Support