Thore Bhasha-Setu

Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.

Thore Network PVT LTD
Thorenetwork

About Model

Dataset Strategy

The model's performance hinges on a high-quality, diverse dataset. The "Thore Bhasha-Setu 1B" dataset would be a curated collection of approximately 200 billion tokens.

Data Composition:

Massive Monolingual Corpus (60%): Sourced from web crawls of regional news sites (e.g., Dainik Jagran, Anandabazar Patrika), digital libraries, Indic Wikipedia, and publicly available books. This builds the model's core understanding of each language's grammar and vocabulary.
Parallel Corpus (20%): High-quality translated text. This includes government documents (press releases, legal texts available in multiple languages), professionally translated news articles, and movie subtitles. This is crucial for translation tasks. The Bhashini project's Samanantar corpus would be a key resource here.
Code-Mixed & Conversational Data (15%): Sourced from social media (Twitter, Reddit), chat logs (anonymized), and movie scripts. This dataset is essential for the model to understand how Indians naturally communicate.
Instructional & QA Corpus (5%): A curated set of question-answer pairs, summarization tasks, and classification examples to teach the model how to follow instructions and perform specific downstream tasks.

Data Preprocessing:

Cleaning: Removal of HTML tags, boilerplate text, and duplicate entries.
Language Identification: Rigorous classification to ensure data is correctly labeled.
Normalization: Standardizing text, especially handling variations in script and transliteration.
Tokenization: Using a custom SentencePiece tokenizer trained on the entire Indic corpus to efficiently handle the morphology of Indian languages.