Indian Flag
Government Of India
A-
A
A+
SpeeD-TB - Meitei

SpeeD-TB - Meitei

A speech and transcription dataset for the Meitei (Manipuri) language containing raw and transcribed audio collected through multiple elicitation and recording methods to support technology, NLP research and language preservation.

About Dataset

This dataset is a collection of raw and transcribed speech data for the Meitei language, curated under the SpeeD-TB project for Tibeto-Burman Indian languages. The data will be collected approximately from 80-100 speakers, and mostly from the age groups 20-50 years. The dataset is mostly collected from education, agriculture and science and technlogy domain. It contains audio recordings of approximately 5.5 hours and corresponding transcriptions. The dataset contributes to digital language preservation and computational resource development for Meitei, which is an important Tibeto-Burman language spoken in India. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support The Development Of Speech Technologies And Natural Language Processing Tools For The Meitei Language. It Can Be Used For Automatic Speech Recognition, Speech-to-text Modeling, Multilingual Ai Systems, Linguistic Analysis, Language Modeling, Transcription Research, Low-resource Language Technology Development, And Preservation Of Linguistic And Cultural Heritage Associated With The Meitei Language.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 1
  • File Size 2.21 GB
  • Views 12

Tags Tags

  • Manipuri
  • Meitei
  • Manipur
  • Northeast India Languages
  • Low Resource NLP
  • Multilingual NLP
  • Indic NLP
  • Multilingual Data
  • Speech dataset
  • Automatic speech recognition
  • Speech-to-text
  • Low-resource languages
  • Transcribed speech
  • Audio corpus
  • Tibeto-Burman languages
  • Speech technology
  • Indic languages
  • Computational linguistics

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

SpeeDTBMeitei ( 1 files, 1 directories )


Directory
audio

1512 files

application/json
transcription.json

823.59 KB

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(2.21 GB)
  • Nikil Augustine·1 day(s) ago
    • chevron_rightFolder
      SpeeDTBMeitei
      • chevron_rightFolder
        audio
      • application/json
        transcription.json

Related Datasets Related Datasets

Updated 1 day(s) ago
SpeeD-TB - Kokborok
SpeeD-TB - Kokborok
Information
A speech and transcription dataset for the Kokborok language containing raw and transcribed audio collected through multiple elication and recording methodsto support technology, NLP research and language preservation.
Multilingual AI
Natural language processing
Language preservation
Language modeling
Kokborok language
Speech dataset
Automatic speech recognition
Speech-to-text
Low-resource languages
Transcribed speech
Audio corpus
Tibeto-Burman languages
Speech technology
Indic languages
Computational linguistics
  • See Upvoters0
  • Downloads1
  • File Size2.12 GB
  • Views20

DIGITAL INDIA BHASHINI DIVISION

Updated 1 month(s) ago
SpeeD-TB-Toto
SpeeD-TB-Toto
Information-
200 hours of transcribed speech dataset of Toto, an endangered Tibeto-Burman language spoken in Totopara village of West Bengal, developed under the SpeeD-TB Project.
speech
Automatic Speech Recognition
ASR
audio
txo
MaTra Lab
Atekho
LiFE app
  • See Upvoters0
  • Downloads10
  • File Size0
  • Views43

DIGITAL INDIA BHASHINI DIVISION