Bridging Gaps in VLMs with Agentic Systems - AI Tinkerers & Google Cloud: Agents Hackathon Toronto
AI Tinkerers - Toronto
Hackathon Showcase

Bridging Gaps in VLMs with Agentic Systems

Team consisting of AI/ML engineers & data scientists from Sharpestminds, CIBC, Pythian — expertise in real-time LLMs/Agentic AI, Graph RAG, Android, CV, RL, finance ML.

4 members

Presentation:
https://docs.google.com/presentation/d/1G7NQXXtuvWBXCMdWAjQ4qjxmRA2-lIYTq-6MZFMy1fo/edit?usp=sharing

Project Summary: Bridging Gaps in VLMs with Agentic Systems

Note: the project is research focused

This project tackles a core limitation of today’s Vision-Language Models (VLMs): their inability to reason reliably about fine-grained visual details such as object counting, orientation, or spatial relations. While SOTA VLMs like Google’s Gemini Robotics-ER 1.5 achieve strong results on general perception tasks, we show that they still falter in precise, grounded visual comprehension and reasoning.

Our approach introduces an agentic visual reasoning framework that augments VLMs with specialized computer vision tools and iterative thinking loops. By integrating segmentation model that produce high fidelity segmentation masks (Meta’s SAM) and a custom-built Deep Inspect tool, the system enables the agent to zoom into regions of interest, analyze segmented components, and reason step-by-step with code execution and multimodal context. This transforms zero-shot perception into an active inspection process, allowing for more accurate and interpretable decisions.

Deployed using Google’s Agent Development Kit (ADK) on Vertex AI Agent Engine, the system demonstrates how combining agentic reasoning with specialized vision models bridges key gaps in current VLMs — paving the way for robust, tool-enhanced multimodal AI that sees and reasons more like humans.

We believe such framework is the way to move forward for VLM advancement. Utilizing specialized models, first as tools, can be further reinforced into the VLMs through verifiable reward / HITL feedback and incorporated as native capabilities. Moreover, specialized CV models allow for true probabilistic estimation, grounding the model and paving the way for safe use

Accomplishments:

  • Successfully shown the advantage of the proposed approach over current SOTA VLMs on several real life examples
  • Deployed Segment Anything Model on Vertex AI
  • Deployed the Advanced Visual Agent built with ADK on Vertex AI Agent Engine
    • Based on latest Gemini Robotics ER 1.5
    • Note: Some deployment issues remain (occasionally too large response)

We only did research prior to hackathon. ADK code was written during the hackathon