A model that integrates OpenAI's B-16 visual encoder with LLMs to enhance cross-modal understanding and retrieval.
LLM2CLIP-Openai-B-16 combines OpenAI's B-16 visual encoder with Large Language Models to improve the textual discriminability of output embeddings. Through fine-tuning in the caption space with contrastive learning, this model addresses the context window limitations of traditional CLIP text encoders. The result is enhanced performance in cross-modal tasks, such as image-text retrieval and zero-shot classification, by effectively processing longer and more complex textual inputs.
Apache 2.0
Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu
vision foundation model, feature backbone
Other
Open
Sector Agnostic
20/08/25 05:42:40
0
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.