Indian Flag
Government Of India
A-
A
A+

Thore Bhasha-Setu

Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.

About Model

Dataset Strategy


The model's performance hinges on a high-quality, diverse dataset. The "Thore Bhasha-Setu 1B" dataset would be a curated collection of approximately 200 billion tokens.

  • Data Composition:

  • Massive Monolingual Corpus (60%): Sourced from web crawls of regional news sites (e.g., Dainik Jagran, Anandabazar Patrika), digital libraries, Indic Wikipedia, and publicly available books. This builds the model's core understanding of each language's grammar and vocabulary.

  • Parallel Corpus (20%): High-quality translated text. This includes government documents (press releases, legal texts available in multiple languages), professionally translated news articles, and movie subtitles. This is crucial for translation tasks. The Bhashini project's Samanantar corpus would be a key resource here.

  • Code-Mixed & Conversational Data (15%): Sourced from social media (Twitter, Reddit), chat logs (anonymized), and movie scripts. This dataset is essential for the model to understand how Indians naturally communicate.

  • Instructional & QA Corpus (5%): A curated set of question-answer pairs, summarization tasks, and classification examples to teach the model how to follow instructions and perform specific downstream tasks.

  • Data Preprocessing:

  1. Cleaning: Removal of HTML tags, boilerplate text, and duplicate entries.

  2. Language Identification: Rigorous classification to ensure data is correctly labeled.

  3. Normalization: Standardizing text, especially handling variations in script and transliteration.

  4. Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.

Thore Bhasha-Setu

Metadata Metadata

Attribution 4.0 International (CC BY- 4.0)

Alok Kumar

Translation Model

ONNX

Open

Thore Network PVT LTD

Sector Agnostic

04/10/25 05:21:22

Alok Kumar

170.07 KB

Activity Overview Activity Overview

  • Downloads0
  • Downloads 9
  • Views 206
  • File Size 170.07 KB

Tags Tags

  • Bhashini

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

Version Control Version Control

FolderVersion 1(170.07 KB)
  • admin·4 month(s) ago
    • undefined
      train_bhasha_setu.py

Related Models Related Models

txgemma-9b-predict
google/txgemma-9b-predict
Transformers
safetensors
gemma2
Text Generation
therapeutics
drug-development
en
arxiv:2504.06196
arxiv:2406.06316
license:other
autotrain_compatible
text-generation-inference
endpoints_compatible
region:us
  • See Upvoters0
  • Downloads2
  • File Size0
  • Views102
Updated 5 month(s) ago

GOOGLE LLC

More Models from Thore Network PVT LTD More Models from Thore Network PVT LTD

Thore Bhasha-Setu
Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.
Bhashini
  • See Upvoters0
  • Downloads9
  • File Size170.07 KB
  • Views206
Updated 3 month(s) ago

THORE NETWORK PVT LTD

Project MAARG: AI-Powered Road Safety for India
An AI model that predicts real-time road accident risk by analyzing live traffic, weather, and historical data to enhance public safety.
Smart Mobility
Citizen Engagement
Traffic Safety
AI monitoring
  • See Upvoters1
  • Downloads38
  • File Size7.99 KB
  • Views539
Updated 5 month(s) ago

THORE NETWORK PVT LTD