A dataset that connects structured image concepts to language, offering detailed annotations of objects, attributes, and relationships within images, facilitating scene understanding.
Visual Genome is a densely annotated vision–language dataset that connects images with objects, attributes, relationships, region descriptions, and question–answer pairs. Developed by researchers at Stanford University, the dataset provides structured representations of visual scenes beyond simple object labels. It aims to capture detailed scene understanding by modeling how objects relate to one another within images.
Visual Genome Is Used To Train And Evaluate Models On Complex Visual Reasoning, Scene Graph Generation, And Multimodal Understanding. It Supports Tasks Such As Visual Question Answering, Relationship Detection, And Grounded Language Learning. The Dataset Helps Models Move Beyond Object Recognition Toward Deeper Semantic And Relational Understanding Of Visual Scenes.
Creative Commons Attribution Non Commercial 4.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.