A model integrating Large Language Models (LLMs) with the EVA02 visual encoder to enhance cross-modal understanding.
LLM2CLIP-EVA02-L-14-336 is designed to extend the capabilities of the original CLIP model by incorporating the EVA02 visual encoder. This integration allows for improved processing of complex and lengthy textual descriptions, overcoming the context window limitations of the standard CLIP text encoder. The model excels in cross-modal tasks, such as image-text retrieval and zero-shot image classification, by leveraging the strengths of both LLMs and advanced visual encoders.
Apache 2.0
Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu
vision foundation model, feature backbone
Other
Open
Sector Agnostic
20/08/25 05:40:51
0
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.