Indian Flag
Government Of India
A-
A
A+

Ganga-2-1B

The first pre-trained Hindi model by any academic research lab in India 🇮🇳!

About Model

Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages. To achieve this, we train models on the monolingual regional languages of India. Our first release is the Ganga-1B model, which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality). Additionally, the dataset has been further curated by native Indian speakers to ensure high quality. Significantly, the Ganga-2-1B model outperforms existing open-source models that support Indian languages, even at sizes of up to 7 billion parameters.

Developed by: Lingo Research Group at IIT Gandhinagar

Model type: Autoregressive Language Model

Language(s): Bilingual (Primary: Hindi [hi], Secondary: English [en])

The instruct tunned model Ganga-2-1b trained on a monolingual Hindi language dataset as part of Project Unity. We propose the name Ganga 🌊 to honor the longest river flowing through the Hindi-speaking region of India 🇮🇳.

Disclaimer: This is a text-completion model designed for fine-tuning on downstream tasks. It is not intended for direct use as a chat or instruction-following model.

Technical Specifications 🤖
    • PrecisionBFloat16
    • Context Length2,048
    • Learning Rate4e-4
    • OptimizerAdamW
    • LR SchedulerCosine

    Model Architecture and Objective: 

    Ganga-2-1b is a decoder-only transformer model featuring the following specifications:

      • Layers: 16
      • Attention heads: 32
      • Embedding dimension: 2,048
      • Vocabulary size: 32,768
      • Sliding window: 1024
      • Intermediate dimension: 7,168

      Results:

      Tokenizers Results:

      ModelFertility
      Ganga-2-1b1.12
      Pragna-1b1.58
      Bloom-1b11.27
      Bloom-1b71.27
      Gemma-2b1.89
      Bloom-3b1.27
      Airavata-7b1.69
      Sarvam-2b1.38

      Metrics:
      ModelPPLSangraha Dataset
      Ganga-2-1b8.09
      Ganga-1b15.82
      Pragna-1b9.37
      Bloom-1b117.49
      Bloom-1b714.28
      Gemma-2b31.01
      Bloom-3b12.82
      OpenHathi-7B25.73
      Airavata-7b38.24
      Sarvam-2b10.31

      Recommendations ‼️

      This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.

      Ganga-2-1B

      Metadata Metadata

      Apache 2.0

      Aamod Thakur, Mayank singh

      Large Language Models

      Transformers

      Open

      IITGN

      Science, Technology and Research

      20/08/25 05:43:32

      1.88 GB

      Activity Overview Activity Overview

      • Downloads1
      • Downloads 37
      • Views 392
      • File Size 1.88 GB

      Tags Tags

      • Text Generation

      License Control License Control

      Apache 2.0

      Version Control Version Control

      FolderVersion 1(1.88 GB)
      • admin·8 month(s) ago
        • undefined
          .DS_Store
        • undefined
          .gitattributes
        • application/json
          config.json
        • application/json
          generation_config.json
        • undefined
          model.safetensors
        • application/json
          special_tokens_map.json
        • application/json
          tokenizer_config.json
        • application/json
          tokenizer.json

      More Models from IITGN More Models from IITGN

      COMI-LINGUA-POS
      This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.
      Hinglish
      • See Upvoters0
      • Downloads0
      • File Size979.67 MB
      • Views69
      Updated 30 day(s) ago

      IITGN

      COMI-LINGUA-MT
      This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.
      Code-Mixing
      Hinglish
      • See Upvoters0
      • Downloads0
      • File Size1.89 GB
      • Views82
      Updated 30 day(s) ago

      IITGN

      COMI-LINGUA-MLI
      This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).
      Hinglish
      Code-Mixing
      • See Upvoters0
      • Downloads0
      • File Size1.89 GB
      • Views68
      Updated 30 day(s) ago

      IITGN

      COMI-LINGUA-LID
      This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).
      Code-Mixing
      Hinglish
      • See Upvoters0
      • Downloads0
      • File Size1.89 GB
      • Views72
      Updated 30 day(s) ago

      IITGN

      COMI-LINGUA-NER
      This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).
      Code-Mixing
      Hinglish
      • See Upvoters0
      • Downloads0
      • File Size979.66 MB
      • Views71
      Updated 30 day(s) ago

      IITGN

      Ganga-2-1B
      The first pre-trained Hindi model by any academic research lab in India 🇮🇳!
      Text Generation
      • See Upvoters1
      • Downloads37
      • File Size1.88 GB
      • Views393
      Updated 8 month(s) ago

      IITGN