Sangraha is the largest high-quality, cleaned Indic language pretraining data containing 251B tokens summed up over 22 languages, extracted from curated sources, existing multilingual corpora and large scale translations.
More information:
Sangraha contains three broad components:
| Lang Code | Verified | Synthetic | Unverified | Total Tokens (in Millions) |
|---|---|---|---|---|
| asm | 292.1 | 11,696.4 | 17.5 | 12,006.0 |
| ben | 10,604.4 | 13,814.1 | 5,608.8 | 30,027.5 |
| brx | 1.5 | - | - | 1.5 |
| doi | 0.06 | - | - | 0.06 |
| eng | 12,759.9 | - | - | 12,759.9 |
| gom | 10.1 | - | - | 10.1 |
| guj | 3,647.9 | 12,934.5 | 597.0 | 17,179.4 |
| hin | 12,617.3 | 9,578.7 | 12,348.3 | 34,544.3 |
| kan | 1,778.3 | 12,087.4 | 388.8 | 14,254.5 |
| kas | 0.5 | - | - | 0.5 |
| mai | 14.6 | - | - | 14.6 |
| mal | 2,730.8 | 13,130.0 | 547.8 | 16,408.6 |
| mar | 2,827.0 | 10,816.7 | 652.1 | 14,295.8 |
| mni | 7.4 | - | - | 7.4 |
| npi | 1,822.5 | 10,588.7 | 485.5 | 12,896.7 |
| ori | 1,177.1 | 11,338.0 | 23.7 | 12,538.8 |
| pan | 1,075.3 | 9,969.6 | 136.9 | 11,181.8 |
| san | 1,329.0 | 13,553.5 | 9.8 | 14,892.3 |
| sat | 0.3 | - | - | 0.3 |
| snd | 258.2 | - | - | 258.2 |
| tam | 3,985.1 | 11,859.3 | 1,515.9 | 17,360.3 |
| urd | 3,658.1 | 9,415.8 | 1,328.2 | 14,402.1 |
| tel | 3,706.8 | 11,924.5 | 647.4 | 16,278.7 |
| Total | 64,306.1 | 162,707.9 | 24,307.7 | 251,321.0 |
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.