
Sanmati engages women to create "gold-standard" datasets that reflect their lived experiences, helping to identify and mitigate gender bias in Indic language models.
Large Language Models (LLMs) are increasingly used in translation systems, digital assistants, search engines, and public digital services. However, these models are typically trained on massive datasets collected from the internet, which often contain historical and cultural biases. As a result, LLMs may reproduce or amplify stereotypes, particularly around gender roles, occupations, and social norms. This challenge becomes even more complex in multilingual societies like India, where linguistic nuances, cultural contexts, and regional variations influence how bias appears in language. Sanmati addresses this issue by developing a participatory framework that actively involves women in identifying and mitigating gender bias in Indic language models.
The project focuses on building high-quality “gold-standard” datasets that reflect women’s lived experiences across different linguistic and cultural contexts. Instead of relying solely on automated data collection, Sanmati employs women contributors to manually analyze, code, and benchmark language data in their native languages. These contributors, known as “Karya Champions,” participate in structured workshops where they discuss how gender bias manifests in everyday language. Through these discussions, participants help design culturally relevant coding frameworks that capture subtle forms of bias that automated systems may overlook.
Once the framework is established, participants review sentences, dialogues, and prompts to identify gendered assumptions, stereotypes, and exclusionary patterns. For example, language models often associate certain professions or behaviors with specific genders. By labeling such patterns and generating alternative, balanced examples, the project produces “bias-coded” datasets that can be used to evaluate and improve AI systems. These datasets enable researchers and developers to test language models for gender bias and implement corrective strategies during model training or evaluation.
Beyond the technical outcomes, Sanmati also aims to create inclusive participation in AI development. Women who contribute to the project are compensated for their work, recognizing the value of their linguistic expertise and lived experiences. This approach transforms dataset creation into a form of dignified digital labor while ensuring that communities historically underrepresented in technology development can actively shape the systems that affect them.
The initiative also emphasizes representation across multiple Indic languages, acknowledging that gender bias does not manifest uniformly across regions or cultures. By capturing these linguistic nuances, the resulting datasets provide more accurate tools for improving AI fairness in multilingual contexts.
Ultimately, Sanmati demonstrates that addressing bias in AI requires more than technical fixes. By combining participatory design, community engagement, and structured data creation, the project ensures that women’s voices and experiences are directly embedded into the datasets used to train and evaluate language models. In doing so, it helps build AI systems that are more equitable, culturally aware, and reflective of the diverse societies they serve.
For additional context and detailed documentation of this use case, please refer to pages 64-67 in the attached Casebook.
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.