Indian Flag
Government Of India
A-
A
A+

Thore Bhasha-Setu

Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.

About Model

Dataset Strategy


The model's performance hinges on a high-quality, diverse dataset. The "Thore Bhasha-Setu 1B" dataset would be a curated collection of approximately 200 billion tokens.

  • Data Composition:

  • Massive Monolingual Corpus (60%): Sourced from web crawls of regional news sites (e.g., Dainik Jagran, Anandabazar Patrika), digital libraries, Indic Wikipedia, and publicly available books. This builds the model's core understanding of each language's grammar and vocabulary.

  • Parallel Corpus (20%): High-quality translated text. This includes government documents (press releases, legal texts available in multiple languages), professionally translated news articles, and movie subtitles. This is crucial for translation tasks. The Bhashini project's Samanantar corpus would be a key resource here.

  • Code-Mixed & Conversational Data (15%): Sourced from social media (Twitter, Reddit), chat logs (anonymized), and movie scripts. This dataset is essential for the model to understand how Indians naturally communicate.

  • Instructional & QA Corpus (5%): A curated set of question-answer pairs, summarization tasks, and classification examples to teach the model how to follow instructions and perform specific downstream tasks.

  • Data Preprocessing:

  1. Cleaning: Removal of HTML tags, boilerplate text, and duplicate entries.

  2. Language Identification: Rigorous classification to ensure data is correctly labeled.

  3. Normalization: Standardizing text, especially handling variations in script and transliteration.

  4. Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.

Thore Bhasha-Setu

Metadata Metadata

Attribution 4.0 International (CC BY- 4.0)

Alok Kumar

Translation Model

ONNX

Open

Thore Network PVT LTD

Sector Agnostic

04/10/25 05:21:22

Alok Kumar

170.07 KB

train_bhasha_setu.py ( 170.07 KB )


To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 12
  • File Size 170.07 KB
  • Views 337

Tags Tags

  • Bhashini

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

Version Control Version Control

FolderVersion 1(170.07 KB)
  • admin·7 month(s) ago
    • undefined
      train_bhasha_setu.py

Related Models Related Models

txgemma-9b-predict
google/txgemma-9b-predict
Transformers
safetensors
endpoints_compatible
Text Generation
text-generation-inference
en
autotrain_compatible
gemma2
region:us
license:other
therapeutics
drug-development
arxiv:2504.06196
arxiv:2406.06316
  • See Upvoters0
  • Downloads2
  • File Size0
  • Views169
Updated 8 month(s) ago

GOOGLE LLC

More Models from Thore Network PVT LTD More Models from Thore Network PVT LTD

Thore Bhasha-Setu
Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.
Bhashini
  • See Upvoters0
  • Downloads12
  • File Size170.07 KB
  • Views338
Updated 6 month(s) ago

THORE NETWORK PVT LTD

Project MAARG: AI-Powered Road Safety for India
An AI model that predicts real-time road accident risk by analyzing live traffic, weather, and historical data to enhance public safety.
Smart Mobility
Citizen Engagement
Traffic Safety
AI monitoring
  • See Upvoters1
  • Downloads62
  • File Size7.99 KB
  • Views848
Updated 8 month(s) ago

THORE NETWORK PVT LTD