A model that combines OpenAI's L-14-224 visual encoder with LLMs to enhance cross-modal understanding and retrieval tasks.
LLM2CLIP-Openai-L-14-224 integrates OpenAI's L-14-224 visual encoder with Large Language Models to improve the textual discriminability of output embeddings. By fine-tuning the LLM in the caption space using contrastive learning, this model enhances performance in cross-modal tasks, including image-text retrieval and zero-shot classification. The approach allows for the incorporation of longer and more complex captions, surpassing the context window limitations of traditional CLIP models.
Apache 2.0
Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu
vision foundation model, feature backbone
Other
Open
Sector Agnostic
20/08/25 05:45:17
0
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.