AI & ML

Google Adds ‘Agentic Vision’ to Gemini Flash 3 for Smarter Image Understanding

by Suraj Malik - 20 hours ago - 5 min read

Google has introduced a new capability called “Agentic Vision” to its Gemini Flash 3 model, aiming to significantly improve how the AI system understands and responds to images. The update, announced on January 28, 2026, is designed to reduce errors and hallucinations by allowing the model to actively inspect images in multiple steps instead of relying on a single visual pass.

According to Google, Agentic Vision changes image understanding from a static process into an iterative, step-by-step workflow, enabling the model to think, act, and observe before producing a final response. The company says this approach helps the model verify visual details more carefully, particularly in complex scenarios that involve counting, measurement, or visual arithmetic.

Moving Beyond Single-Glance Image Analysis

Traditional AI models typically analyse an image once and then generate a response based on that initial interpretation. If important details are missed during this first scan, the model may guess or hallucinate when answering follow-up questions.

Agentic Vision is designed to address this limitation. Google explained that Gemini Flash 3 can now plan a sequence of actions when responding to an image-based query. Instead of immediately replying, the model determines which areas of an image require closer inspection and then performs targeted actions—such as zooming into specific regions or annotating visual elements—before reaching a conclusion.

This “Think, Act, Observe” loop allows the model to re-check visual information and adjust its reasoning based on what it observes during each step.

A new era of intelligence with Gemini 3

Python Code Execution for Verification

A key feature of Agentic Vision is the ability to generate and execute Python code as part of the image analysis process. Google said this allows Gemini Flash 3 to replace probabilistic guessing with deterministic computation.

For example, when asked to count objects in an image, the model can generate code to draw bounding boxes, label each detected object, and then confirm the total count programmatically. The annotated image is then reintroduced into the model’s context, allowing it to visually confirm the result before responding.

Google said this capability is particularly useful for tasks involving visual math, measurements, or detailed inspections, where traditional models often struggle.

Reported Quality Improvements

According to Google, enabling code execution through Agentic Vision results in a 5–10% improvement in quality across vision-related tasks. The company said the approach helps reduce hallucinations that occur when models attempt to infer details without sufficient visual grounding.

The report notes that Agentic Vision is especially effective in situations where accuracy depends on examining small details or performing multiple reasoning steps, such as reading dense diagrams, counting objects, or validating spatial relationships.

Early Use Cases Highlighted by Google

Google cited several early use cases to demonstrate how Agentic Vision performs in real-world scenarios.

One example involved PlanCheckSolver, an AI-powered platform that validates building plans against regulatory requirements. According to Google, the platform uses Gemini Flash 3 to inspect high-resolution architectural drawings and verify parameters such as dimensions and layout details. By enabling code-based inspection and iterative image analysis, PlanCheckSolver reportedly achieved a 5% improvement in accuracy.

Another example showed Gemini Flash 3 being asked to count fingers in an image of a hand. Instead of guessing, the model used Python to visually mark each finger with bounding boxes and numbers, confirming the count before answering.

These examples illustrate how Agentic Vision allows the model to visually verify its own reasoning rather than relying on assumptions.

Availability Across Google Platforms

Agentic Vision is available immediately across multiple Google platforms. Developers can access the feature through the Gemini API, Google AI Studio, and Vertex AI. The capability is also available to users through the Gemini app, where it can be selected via the “Thinking” model option.

Google said Gemini Flash 3 is designed to offer faster performance at lower cost compared to larger Gemini models, making it suitable for applications that require both speed and accuracy.

Planned Enhancements

Google also outlined several planned improvements to Agentic Vision. According to the report, the company intends to make image manipulation actions—such as zooming, rotating, or performing visual math—fully automatic, allowing the model to decide when these steps are necessary without explicit user instructions.

In addition, Google plans to expand the set of tools available to Gemini models, including features like web search and reverse image search, to further enhance reasoning capabilities. Agentic Vision is also expected to be rolled out to other Gemini models beyond Flash in future updates.

Google Updates Bard Chatbot With 'Gemini' A.I. as It Chases ChatGPT - The  New York Times

Why Google Says It Matters

Google positioned Agentic Vision as a response to a core weakness in current AI systems: unreliable image reasoning in multi-step tasks. By combining visual inspection with code execution, the company says Gemini Flash 3 can deliver more grounded and verifiable responses.

The update reflects a broader shift toward agentic AI systems—models that can plan actions, use tools, and validate results rather than generating answers in a single step. Google said this approach is particularly important for professional and enterprise use cases where errors can have real-world consequences.

With Agentic Vision, Google is aiming to make image-based AI interactions more reliable by ensuring that models check their work before responding.