Phi-3-vision-128k-instruct is a state-of-the-art multimodal model by Microsoft, designed to process both text and visual inputs with a context length of up to 128,000 tokens.
As part of the Phi-3 model family, Phi-3-vision-128k-instruct combines text and vision modalities to perform complex reasoning tasks across large contexts. The model is trained on high-quality, reasoning-rich datasets, including synthetic data and filtered publicly available web content. It has undergone supervised fine-tuning and direct preference optimization to enhance instruction-following accuracy and safety. Phi-3-vision-128k-instruct excels in tasks such as image captioning, visual question answering, and document analysis. The model is available on platforms like Hugging Face and Azure AI Foundry, supporting applications that require deep multimodal comprehension.
MIT
Microsoft
Multimodal Language Model
N.A.
Open
Sector Agnostic
12/03/25 06:35:30
0
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.