RomanSetu is Efficiently unlocking multilingual (Indian Languages) capabilities of Large Language Models via Romanization.
RomanSetu presents an approach that involves the continual pretraining of an English LLM, such as Llama 2, on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. For continual pretraining, approximately 200 million words per language were collected and transliterated using IndicXlit with romanized data used in training.
Llama 2 Community License Agreement
Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
Multilingual Model
N.A.
Open
Sector Agnostic
02/05/25 11:00:58
0
Llama 2 Community License Agreement
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.