Garo-English Parallel Corpus

A curated set of ~2,500 Garo-English parallel sentence pairs released by MWire Labs to support low-resource translation and experimentation in Northeast Indian languages.

About Dataset

This dataset contains ~2,500 parallel sentence pairs between Garo (ISO-639-3: grt) and English, curated by MWire Labs to support early-stage research, prototyping, and experimentation in low-resource machine translation. Garo is a Tibeto-Burman language spoken primarily in Meghalaya and Assam, and remains underrepresented in mainstream NLP resources. The corpus is formatted with standardized fields (source, target, src_lang, tgt_lang) and is released under the Creative Commons Attribution 4.0 license (CC BY 4.0), enabling open use with attribution. This release is intended to seed community engagement and tool development for Northeast Indian languages. It is suitable for demos, pipeline testing, and exploratory research, but not recommended for benchmarking or production-grade training due to its limited size.

Purpose of Dataset

This Dataset Provides ~2,500 Parallel Sentence Pairs Between Garo And English, Curated To Support Low-resource Nlp Research And Tool Development For Northeast Indian Languages. Garo Is A Tibeto-burman Language Spoken Primarily In Meghalaya And Assam, And Remains Underrepresented In Mainstream Ai Efforts. The Corpus Is Designed For Experimentation, Prototyping, And Community Engagement, Enabling Researchers And Developers To Build Translation Models, Lexicons, And Evaluation Pipelines For Garo. It Contributes To Inclusive Language Technology And Aligns With National Goals Of Linguistic Diversity In Ai. Mwire Labs Aims To Seed Civic-first Nlp Infrastructure And Amplify Visibility For Northeast Indian Languages Within Broader Ai Ecosystems. The Dataset Supports Policy Goals Around Digital Inclusion, Regional Language Preservation, And Equitable Ai Development.

Dataset Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Geographical coverage

Meghalaya

Sector

Social

Author

MWirelabs

Source Organisation

MWire Labs

Uploaded by

Badal Nyalang

Data Quality Score (Beta)

Dataset type

Structured

Frequency

Static

Time Granularity

Static

Year range

N.A.

Date & Time

19/11/25 11:59:55

Visibility

Open

Hosted / Redirected

Redirected

Data Type

Primary

If Redirection which source

Hugging Face

Data Collection Method

Synthetic English Sentences Were Generated Using Templated Prompts And Curated Domain-specific Sources. These Were Then Translated Into Garo By Paid Native Speakers From Meghalaya, Ensuring Linguistic Authenticity And Cultural Relevance. Translations Were Manually Reviewed For Quality And Consistency. The Process Prioritized Semantic Fidelity Over Literal Alignment, With Translators Encouraged To Use Natural Garo Phrasing. No Machine Translation Was Used.