Indian Flag
Government Of India
A-
A
A+

Sanmati: Participatory Bias-mitigation for Indic Language Models

Sanmati engages women to create "gold-standard" datasets that reflect their lived experiences, helping to identify and mitigate gender bias in Indic language models.

About Use Case

Large Language Models (LLMs) are increasingly used in translation systems, digital assistants, search engines, and public digital services. However, these models are typically trained on massive datasets collected from the internet, which often contain historical and cultural biases. As a result, LLMs may reproduce or amplify stereotypes, particularly around gender roles, occupations, and social norms. This challenge becomes even more complex in multilingual societies like India, where linguistic nuances, cultural contexts, and regional variations influence how bias appears in language. Sanmati addresses this issue by developing a participatory framework that actively involves women in identifying and mitigating gender bias in Indic language models.

The project focuses on building high-quality “gold-standard” datasets that reflect women’s lived experiences across different linguistic and cultural contexts. Instead of relying solely on automated data collection, Sanmati employs women contributors to manually analyze, code, and benchmark language data in their native languages. These contributors, known as “Karya Champions,” participate in structured workshops where they discuss how gender bias manifests in everyday language. Through these discussions, participants help design culturally relevant coding frameworks that capture subtle forms of bias that automated systems may overlook.

Once the framework is established, participants review sentences, dialogues, and prompts to identify gendered assumptions, stereotypes, and exclusionary patterns. For example, language models often associate certain professions or behaviors with specific genders. By labeling such patterns and generating alternative, balanced examples, the project produces “bias-coded” datasets that can be used to evaluate and improve AI systems. These datasets enable researchers and developers to test language models for gender bias and implement corrective strategies during model training or evaluation.

Beyond the technical outcomes, Sanmati also aims to create inclusive participation in AI development. Women who contribute to the project are compensated for their work, recognizing the value of their linguistic expertise and lived experiences. This approach transforms dataset creation into a form of dignified digital labor while ensuring that communities historically underrepresented in technology development can actively shape the systems that affect them.

The initiative also emphasizes representation across multiple Indic languages, acknowledging that gender bias does not manifest uniformly across regions or cultures. By capturing these linguistic nuances, the resulting datasets provide more accurate tools for improving AI fairness in multilingual contexts.

Ultimately, Sanmati demonstrates that addressing bias in AI requires more than technical fixes. By combining participatory design, community engagement, and structured data creation, the project ensures that women’s voices and experiences are directly embedded into the datasets used to train and evaluate language models. In doing so, it helps build AI systems that are more equitable, culturally aware, and reflective of the diverse societies they serve.

For additional context and detailed documentation of this use case, please refer to pages 64-67 in the attached Casebook.

Source Organization Source Organization

IndiaAI

Tags Tags

  • Gender Equality
  • Gender Empowerment

Tags Sector

Science, Technology and Research

Resources Resources

Related Datasets Related Datasets

Updated 4 month(s) ago
Garo-English Parallel Corpus
Garo-English Parallel Corpus
Information-
A curated set of ~2,500 Garo-English parallel sentence pairs released by MWire Labs to support low-resource translation and experimentation in Northeast Indian languages.
English
Parallel Corpus
low-resource
Garo
northeast-india
tibeto-burman
generated_from_other_dataset
A'chik
  • See Upvoters0
  • Downloads22
  • File Size0
  • Views203

MWIRE LABS