Indian Flag
Government Of India
A-
A
A+

LecaDis

Dictionary Constrained Disambiguation for Improved NMT

About Model

Domain-specific neural machine translation (NMT) systems (, in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. Such NMT systems should be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a source word/phrase due to the polysemous nature of words. The onus is then on the NMT model to choose the contextually most appropriate candidate. Prior work has largely ignored this problem and focused on the single candidate constraint setting wherein the target word or phrase is replaced by a single constraint. In this work, we present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We achieve this by augmenting training data with multiple dictionary candidates to actively encourage disambiguation during training by implicitly aligning multiple candidate constraints. We demonstrate the utility of DictDis via extensive experiments on English-Hindi, English-German, and English-French datasets across a variety of domains including regulatory, finance, engineering, health and standard benchmark test datasets. In comparison with existing approaches for lexically constrained and unconstrained NMT, we demonstrate superior performance for the copy constraint and disambiguation-related measures on all domains, while also obtaining improved fluency of up to 2-3 BLEU points on some domains. We also release our test set consisting of 4K English-Hindi sentences in multiple domains.

LecaDis

Metadata Metadata

CC0 1.0 Public Domain

IIT Bombay

Multilingual Model

PyTorch

Open

IIT Bombay

Education and Skill Development

14/05/25 11:54:12

4.70 GB

Activity Overview Activity Overview

  • Downloads0
  • Downloads 25
  • Views 696
  • File Size 4.70 GB

Tags Tags

  • Multilingual Dataset
  • Indian Language
  • natural language processing (NLP)

License Control License Control

CC0 1.0 Public Domain

Version Control Version Control

FolderVersion 1(4.70 GB)
  • admin·9 month(s) ago
    • chevron_rightFolder
      lecaDis1_model
      • chevron_rightFolder
        final_bin
      • chevron_rightFolder
        models
      • chevron_rightFolder
        vocab

More Models from IIT Bombay More Models from IIT Bombay

LecaDis
Dictionary Constrained Disambiguation for Improved NMT
Multilingual Dataset
Indian Language
natural language processing (NLP)
  • See Upvoters0
  • Downloads25
  • File Size4.70 GB
  • Views697
Updated 9 month(s) ago

IIT BOMBAY