RomanSetu is Efficiently unlocking multilingual capabilities (Indian Languages) of Large Language Models via Romanization.
RomanSetu presents an approach that involves the continual pretraining of an English LLM, such as Llama 2, on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. For continual pretraining, approximately 300 million words per language were collected and transliterated using IndicXlit with romanized datasets.
Llama 2 Community License Agreement
Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
Multilingual Model
N.A.
Open
Sector Agnostic
02/05/25 11:00:59
0
Llama 2 Community License Agreement
© 2026 - Copyright AIKosh. All rights reserved.