Bridging Gaps in VLMs with Agentic Systems
Team consisting of AI/ML engineers & data scientists from Sharpestminds, CIBC, Pythian — expertise in real-time LLMs/Agentic AI, Graph RAG, Android, CV, RL, finance ML.
Project Description
Presentation:
https://docs.google.com/presentation/d/1G7NQXXtuvWBXCMdWAjQ4qjxmRA2-lIYTq-6MZFMy1fo/edit?usp=sharing
Project Summary: Bridging Gaps in VLMs with Agentic Systems
Note: the project is research focused
This project tackles a core limitation of today’s Vision-Language Models (VLMs): their inability to reason reliably about fine-grained visual details such as object counting, orientation, or spatial relations. While SOTA VLMs like Google’s Gemini Robotics-ER 1.5 achieve strong results on general perception tasks, we show that they still falter in precise, grounded visual comprehension and reasoning.
Our approach introduces an agentic visual reasoning framework that augments VLMs with specialized computer vision tools and iterative thinking loops. By integrating segmentation model that produce high fidelity segmentation masks (Meta’s SAM) and a custom-built Deep Inspect tool, the system enables the agent to zoom into regions of interest, analyze segmented components, and reason step-by-step with code execution and multimodal context. This transforms zero-shot perception into an active inspection process, allowing for more accurate and interpretable decisions.
Deployed using Google’s Agent Development Kit (ADK) on Vertex AI Agent Engine, the system demonstrates how combining agentic reasoning with specialized vision models bridges key gaps in current VLMs — paving the way for robust, tool-enhanced multimodal AI that sees and reasons more like humans.
We believe such framework is the way to move forward for VLM advancement. Utilizing specialized models, first as tools, can be further reinforced into the VLMs through verifiable reward / HITL feedback and incorporated as native capabilities. Moreover, specialized CV models allow for true probabilistic estimation, grounding the model and paving the way for safe use
Accomplishments:
- Successfully shown the advantage of the proposed approach over current SOTA VLMs on several real life examples
- Deployed Segment Anything Model on Vertex AI
- Deployed the Advanced Visual Agent built with ADK on Vertex AI Agent Engine
- Based on latest Gemini Robotics ER 1.5
- Note: Some deployment issues remain (occasionally too large response)
Prior Work
We only did research prior to hackathon. ADK code was written during the hackathon
Team
Products & Tools
Additional Links
Presentation