RomanSetu is Efficiently unlocking multilingual (Indian Langauages) capabilities of Large Language Models via Native Scripts.
RomanSetu presents an approach that involves the continual pretraining of an English LLM, such as Llama 2, on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. For continual pretraining, approximately 100 million words per language were collected and transliterated using IndicXlit with native datasets.
Llama 2 Community License Agreement
Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
Multilingual Model
N.A.
Open
Sector Agnostic
02/05/25 11:01:00
0
Llama 2 Community License Agreement
© 2026 - Copyright AIKosh. All rights reserved.