English Malayalam Parallel Corpus

A bilingual parallel corpus containing paired English and Malayalam sentences designed for machine translation, multilingual NLP research, and language model training.

About Dataset

This dataset is a collection of parallel text in English and Malayalam, which can be used for various applications such as machine translation, language learning, natural language processing, and language preservation. The dataset contains a sample of text from various domains, including transportation and travel. The primary objective of this dataset is to facilitate the development of machine translation models for the Malayalam language and contribute to the advancement of NLP research and applications, particularly in the context of Indian languages. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is Designed To Support The Development Of Malayalam Machine Translation Systems And Advance Natural Language Processing Research For Indian Languages. It Can Be Used For Machine Translation, Language Learning Applications, Multilingual Nlp Tasks Such As Text Classification And Sentiment Analysis, Language Modeling And Preservation Of The Malayalam Language And Cultural Heritage.

Dataset Metadata

License

Database Contents License (DbCL) v1.0

Geographical coverage

India

Sector

Education and Skill Development

Author

Subin Erattakulangara

Source Organisation

Digital India BHASHINI Division

Uploaded by

Nikil Augustine

Data Quality Score (Beta)

4.75

Dataset type

Structured

Frequency

Static

Time Granularity

Static

Year range

N.A.

Date & Time

21/05/26 08:46:58

Visibility

Open

Hosted / Redirected

Hosted

Data Type

Hybrid

Data Collection Method

This Dataset Contains Approximately 400,000 Parallel Sentences, With English Sentences Sourced From The Coco Dataset (Https://github.com/narvidhai/coco-english-malayalam-translation-corpus/)and Translated Into Malayalam Using The Google Api.