Ganga-2-1B

The first pre-trained Hindi model by any academic research lab in India 🇮🇳!

IITGN
MayankSingh

About Model

Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages. To achieve this, we train models on the monolingual regional languages of India. Our first release is the Ganga-1B model, which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality). Additionally, the dataset has been further curated by native Indian speakers to ensure high quality. Significantly, the Ganga-2-1B model outperforms existing open-source models that support Indian languages, even at sizes of up to 7 billion parameters.

Developed by: Lingo Research Group at IIT Gandhinagar

Model type: Autoregressive Language Model

Language(s): Bilingual (Primary: Hindi [hi], Secondary: English [en])

The instruct tunned model Ganga-2-1b trained on a monolingual Hindi language dataset as part of Project Unity. We propose the name Ganga 🌊 to honor the longest river flowing through the Hindi-speaking region of India 🇮🇳.

Disclaimer: This is a text-completion model designed for fine-tuning on downstream tasks. It is not intended for direct use as a chat or instruction-following model.

Technical Specifications 🤖

Precision: BFloat16
Context Length: 2,048
Learning Rate: 4e-4
Optimizer: AdamW
LR Scheduler: Cosine

Model Architecture and Objective:

Ganga-2-1b is a decoder-only transformer model featuring the following specifications:

Layers: 16
Attention heads: 32
Embedding dimension: 2,048
Vocabulary size: 32,768
Sliding window: 1024
Intermediate dimension: 7,168

Results:

Tokenizers Results:

Model Fertility
Ganga-2-1b 1.12
Pragna-1b 1.58
Bloom-1b1 1.27
Bloom-1b7 1.27
Gemma-2b 1.89
Bloom-3b 1.27
Airavata-7b 1.69
Sarvam-2b 1.38

Model	Fertility
Ganga-2-1b	1.12
Pragna-1b	1.58
Bloom-1b1	1.27
Bloom-1b7	1.27
Gemma-2b	1.89
Bloom-3b	1.27
Airavata-7b	1.69
Sarvam-2b	1.38

Metrics:

Model PPLSangraha Dataset
Ganga-2-1b 8.09
Ganga-1b 15.82
Pragna-1b 9.37
Bloom-1b1 17.49
Bloom-1b7 14.28
Gemma-2b 31.01
Bloom-3b 12.82
OpenHathi-7B 25.73
Airavata-7b 38.24
Sarvam-2b 10.31

Model	PPLSangraha Dataset
Ganga-2-1b	8.09
Ganga-1b	15.82
Pragna-1b	9.37
Bloom-1b1	17.49
Bloom-1b7	14.28
Gemma-2b	31.01
Bloom-3b	12.82
OpenHathi-7B	25.73
Airavata-7b	38.24
Sarvam-2b	10.31

Recommendations ‼️

This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.

Ganga-2-1B

Metadata

License

Apache 2.0

Hosted By

Aamod Thakur, Mayank singh

Model Type

Large Language Models

Model Format

Transformers

Visibility

Open

Source organisation

IITGN

Sector

Science, Technology and Research

Updated Date & Time

20/08/25 05:43:32

Created By

Lingo Research Group

Size

1.88 GB

Activity Overview

1
37
392
1.88 GB

License Control

Apache 2.0

Version Control

Version 1(1.88 GB)

admin·8 month(s) ago
- .DS_Store
- .gitattributes
- config.json
- generation_config.json
- model.safetensors
- special_tokens_map.json
- tokenizer_config.json
- tokenizer.json

More Models from IITGN

COMI-LINGUA-POS

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.

Hinglish

0
0
979.67 MB
69

Updated 30 day(s) ago

IITGN

View Details

COMI-LINGUA-MT

This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.

Code-Mixing

Hinglish

0
0
1.89 GB
82

Updated 30 day(s) ago

IITGN

View Details

COMI-LINGUA-MLI

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).

Hinglish

Code-Mixing

0
0
1.89 GB
68

Updated 30 day(s) ago

IITGN

View Details

COMI-LINGUA-LID

This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).

Code-Mixing

Hinglish

0
0
1.89 GB
72

Updated 30 day(s) ago

IITGN

View Details

COMI-LINGUA-NER

This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).

Code-Mixing

Hinglish

0
0
979.66 MB
71

Updated 30 day(s) ago

IITGN

View Details

Ganga-2-1B

The first pre-trained Hindi model by any academic research lab in India 🇮🇳!

Text Generation

1
37
1.88 GB
393

Updated 8 month(s) ago

IITGN

View Details

Accessibility options by UX4G

Ganga-2-1B

About Model

Ganga-2-1B

Metadata

Activity Overview

Tags

License Control

Version Control

Version 1(1.88 GB)

.DS_Store

.gitattributes

config.json

generation_config.json

model.safetensors

special_tokens_map.json

tokenizer_config.json

tokenizer.json

More Models from IITGN

AIKosh

Resources

Support