
A curated set of ~2,500 Garo-English parallel sentence pairs released by MWire Labs to support low-resource translation and experimentation in Northeast Indian languages.
This dataset contains ~2,500 parallel sentence pairs between Garo (ISO-639-3: grt) and English, curated by MWire Labs to support early-stage research, prototyping, and experimentation in low-resource machine translation. Garo is a Tibeto-Burman language spoken primarily in Meghalaya and Assam, and remains underrepresented in mainstream NLP resources. The corpus is formatted with standardized fields (source, target, src_lang, tgt_lang) and is released under the Creative Commons Attribution 4.0 license (CC BY 4.0), enabling open use with attribution. This release is intended to seed community engagement and tool development for Northeast Indian languages. It is suitable for demos, pipeline testing, and exploratory research, but not recommended for benchmarking or production-grade training due to its limited size.
This Dataset Provides ~2,500 Parallel Sentence Pairs Between Garo And English, Curated To Support Low-resource Nlp Research And Tool Development For Northeast Indian Languages. Garo Is A Tibeto-burman Language Spoken Primarily In Meghalaya And Assam, And Remains Underrepresented In Mainstream Ai Efforts. The Corpus Is Designed For Experimentation, Prototyping, And Community Engagement, Enabling Researchers And Developers To Build Translation Models, Lexicons, And Evaluation Pipelines For Garo. It Contributes To Inclusive Language Technology And Aligns With National Goals Of Linguistic Diversity In Ai. Mwire Labs Aims To Seed Civic-first Nlp Infrastructure And Amplify Visibility For Northeast Indian Languages Within Broader Ai Ecosystems. The Dataset Supports Policy Goals Around Digital Inclusion, Regional Language Preservation, And Equitable Ai Development.
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.