A model that merges LLMs with the EVA02-B-16 visual encoder to boost cross-modal task performance.
LLM2CLIP-EVA02-B-16 is part of the LLM2CLIP series, aiming to enhance the capabilities of CLIP models by integrating Large Language Models with the EVA02-B-16 visual encoder. This fusion enables the model to process more detailed and extended textual descriptions, improving performance in tasks like image-text retrieval and zero-shot image classification. The approach leverages the strengths of both LLMs and advanced visual encoders to push the boundaries of cross-modal understanding.
Apache 2.0
Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu
vision foundation model, feature backbone
Other
Open
Sector Agnostic
20/08/25 05:42:10
0
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.