A biomedical vision-language foundation model trained on PMC-15M, using PubMedBERT as the text encoder and Vision Transformer as the image encoder, optimized for cross-modal retrieval, classification, and visual question answering in medical AI applications.
BiomedCLIP is a state-of-the-art biomedical vision-language model designed for multimodal learning in medical AI. Developed by Microsoft, it is pre-trained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. The model combines: 1. PubMedBERT as the text encoder for domain-specific language understanding. 2. Vision Transformer (ViT) as the image encoder with specialized adaptations for medical imaging tasks. BiomedCLIP significantly outperforms prior vision-language models in various medical AI benchmarks and supports the following applications: 1. Cross-modal retrieval (text-to-image and image-to-text search). 2. Zero-shot image classification for medical images. 3. Visual question answering (VQA) in radiology and pathology. Trained on a diverse range of medical imaging modalities, including radiography, microscopy, and histology, BiomedCLIP establishes new performance standards in biomedical visual-language tasks. However, the model is intended for research purposes only and is not suitable for clinical decision-making or commercial deployment. It serves as a valuable tool for AI researchers exploring multimodal medical applications in radiology, pathology, and beyond.
MIT
MicroSoft
Zero-Shot Image Classification
N.A.
Open
Healthcare, Wellness and Family Welfare
11/04/25 06:25:10
0
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.