Google has announced new Agentic Vision capabilities for its Gemini 3 Flash model, introducing a system that combines visual reasoning with code execution to improve image understanding accuracy.
Unlike traditional vision models that process images in a single static pass, Agentic Vision enables Gemini 3 Flash to analyze visuals through a multi-step, agent-driven process. The model can zoom into specific regions of an image, examine details, and execute code to validate its observations.
According to Google, enabling code execution within Gemini 3 Flash leads to a 5% to 10% improvement across multiple computer vision benchmarks.
Agentic Vision operates through a structured “Think, Act, Observe” loop. In the thinking phase, the model analyzes the user query and image to generate a multi-step plan. During the action phase, it generates and executes Python code to process or analyze the image. In the observation phase, the transformed visual data is reintroduced into the model’s context window before producing a final response.
Google stated that Gemini 3 Flash can now go beyond basic image description by directly drawing on canvases, zooming into fine details, parsing dense tables, and performing arithmetic operations based on visual inputs.
The new capability is being rolled out to the Gemini app via the Thinking model, with developer access available through Google AI Studio and the Gemini API on Vertex AI.
Source: Webrazzi







