A model that integrates the Llama3.2-1B language model with the EVA02-L-14-336 visual encoder to enhance cross-modal understanding and retrieval tasks.
The LLM2CLIP-Llama3.2-1B-EVA02-L-14-336 model is part of the LLM2CLIP series, aiming to extend the capabilities of CLIP models by combining Large Language Models (LLMs) with advanced visual encoders. This integration allows the model to process more detailed and extended textual descriptions, overcoming the context window limitations of traditional CLIP text encoders. By fine-tuning the LLM in the caption space using contrastive learning, the model enhances the textual discriminability of output embeddings. This approach leads to substantial improvements in cross-modal tasks, such as image-text retrieval and zero-shot image classification. Experiments have demonstrated that this method boosts performance significantly, transforming a CLIP model trained solely on English data into a state-of-the-art cross-lingual model. Moreover, when integrated into multimodal training with models like Llava 1.5, it consistently outperforms traditional CLIP models across nearly all benchmarks, showcasing comprehensive performance enhancements.
Apache 2.0
Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu
vision foundation model, feature backbone
Other
Open
Sector Agnostic
20/08/25 05:43:55
0
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.