Home/Models/Phi-4-Multimodal-Instruct - Multimodal Foundation Model

Phi-4-Multimodal-Instruct - Multimodal Foundation Model

A lightweight multimodal AI model that processes text, image, and audio inputs, optimized for multilingual reasoning, speech recognition, vision-language tasks, and generative AI applications.

Microsoft Corporation (India) Pvt. Ltd.
Vimalho

About Model

Phi-4-Multimodal-Instruct is an advanced multimodal foundation model developed by Microsoft, designed to integrate language, vision, and speech for research and commercial applications. It builds upon the Phi-3.5 and Phi-4 models, supporting 128K token context length and incorporating supervised fine-tuning, direct preference optimization, and reinforcement learning from human feedback (RLHF) to enhance performance and safety. Key Features: Supports multiple modalities: Text: 24 languages, including Arabic, Chinese, English, French, Spanish, and more. Vision: Optimized for English image understanding. Audio: Supports English, Chinese, German, French, Italian, Japanese, Spanish, and Portuguese speech processing. Enhanced capabilities: Speech recognition and speech translation (outperforms WhisperV3 and SeamlessM4T). Strong reasoning in math, logic, and general knowledge. Vision-language understanding (chart/table comprehension, optical character recognition). Multi-image comparison and summarization. Speech summarization and QA. Function and tool calling for AI agents. State-of-the-art performance: Ranked #1 on the HuggingFace OpenASR leaderboard for speech recognition (March 2025). Vision processing benchmarks surpass models like Gemini-1.5-Pro and InternOmni-7B. Optimized for real-world applications: Works in memory-constrained environments and low-latency scenarios. Trained on 5 trillion text tokens, 2.3 million speech hours, and 1.1 trillion image-text tokens. Intended Uses: Phi-4-Multimodal-Instruct is designed for broad multilingual and multimodal research and commercial applications, including: 1. General AI assistants for reasoning and knowledge retrieval. 2. Speech AI models for transcription, translation, and summarization. 3. Computer vision AI for image-text comprehension and optical character recognition (OCR). 4. Medical AI research for language-vision understanding. 5. Education and coding AI for knowledge-based tasks.

Phi-4-Multimodal-Instruct - Multimodal Foundation Model

Metadata

License

MIT

Hosted By

Microsoft

Model Type

Multimodal Language Model

Model Format

N.A.

Visibility

Open

Source organisation

Microsoft Corporation (India) Pvt. Ltd.

Sector

Sector Agnostic

Updated Date & Time

12/03/25 06:35:15

Created By

Vikram Malhotra

Size

Activity Overview

License Control

MIT

More Models from Microsoft Corporation (India) Pvt. Ltd.

TAPEX: Large SQL Execution Model (Table Pre-training via Learning a Neural SQL Executor)

A large-sized TAPEX model pre-trained to simulate neural SQL execution, enabling the execution of SQL queries on given tables.

Transformers

SQLExecution

PreTrainedModel

TAPEX

DataRetrieval

NeuralExecutor

BART

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: Large Model (Table Pre-training via Learning a Neural SQL Executor)

A large-sized pre-trained model designed to enhance table-based question answering and fact verification tasks.

BART

TableQuestionAnswering

FactVerification

PreTrainedModel

LargeModel

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: TabFact Data enabled Large Finetuned (Table Pre-training via Learning a Neural SQL Executor) Model

A large-sized TAPEX model fine-tuned on the TabFact dataset, designed to enhance performance in table-based fact verification tasks.

FactVerification

NaturalLanguageProcessing

Transformers

BART

DataValidation

FineTunedModel

TabFact

TAPEX

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX (Table Pre-training via Learning a Neural SQL Executor) Large Finetuned Model

A large-sized TAPEX model fine-tuned on the WikiTableQuestions dataset, designed to enhance performance in table-based question answering tasks.

TAPEX

TableQuestionAnswering

NaturalLanguageProcessing

Transformers

BART

DataExtraction

FineTunedModel

WikiTableQuestions

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: Base Model (Table Pre-training via Learning a Neural SQL Executor)

A base-sized pre-trained model designed to enhance table-based question answering and fact verification tasks.

BART

TableQuestionAnswering

FactVerification

PreTrainedModel

TabularData

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: WikiTable Questions Data enabled Base Finetuned (Table Pre-training via Learning a Neural SQL Executor) Model

A base-sized TAPEX model fine-tuned on the WikiTableQuestions dataset, designed to enhance performance in table-based question answering tasks.

NaturalLanguageProcessing

TableQuestionAnswering

TAPEX

WikiTableQuestions

FineTunedModel

DataExtraction

BART

Transformers

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: WikiSQL Data enabled Base Finetuned (Table Pre-training via Learning a Neural SQL Executor) Model

A large-sized TAPEX model fine-tuned on the WikiSQL dataset, optimized for translating natural language questions into SQL queries for effective table-based question answering.

Transformers

NaturalLanguageProcessing

SQLQueryGeneration

TAPEX

WikiSQL

FineTunedModel

DataRetrieval

BART

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: TabFact Data enabled Base Finetuned (Table Pre-training via Learning a Neural SQL Executor) Model

A base-sized TAPEX model fine-tuned on the TabFact dataset, tailored for verifying the factual accuracy of textual statements against tabular data.

FactVerification

TAPEX

TabFact

FineTunedModel

DataValidation

BART

Transformers

NaturalLanguageProcessing

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

TAPEX: Base Finetuned (Table Pre-training via Learning a Neural SQL Executor) Model

A base-sized TAPEX model fine-tuned on the WikiSQL dataset, designed to enhance performance in table-based question answering tasks.

DataExtraction

NaturalLanguageProcessing

Transformers

BART

TableQuestionAnswering

FineTunedModel

WikiSQL

TAPEX

Updated 7 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

BiomedBERT - Domain-Specific Biomedical Language Model

A biomedical NLP model pre-trained from scratch on abstracts and full-text articles from PubMed and PubMed Central, achieving state-of-the-art performance on biomedical language understanding tasks.

Transformers

inference endpoints

exbert

Bert

English

JAX

PyTorch

Fill-Mask

Updated 10 month(s) ago

MICROSOFT CORPORATION (INDIA) PVT. LTD.

View Details

Accessibility options by UX4G

Phi-4-Multimodal-Instruct - Multimodal Foundation Model

About Model

Phi-4-Multimodal-Instruct - Multimodal Foundation Model

Metadata

Activity Overview

Tags

License Control

More Models from Microsoft Corporation (India) Pvt. Ltd.

AIKosh

Resources

Support