AI开发进阶⑤：多模态Agent实战——让AI能看见和操作

发布时间：2026/5/24 11:21:24

AI 开发进阶第5篇多模态 Agent 实战——让 AI 能看见和操作适合读者已读完基础9篇前④篇想让 Agent 不仅能对话还能看图、操作界面预计阅读时间40分钟作者AI小渔村前言文字不是一切前四篇讨论的都是基于文本的 Agent。但现实世界中大量的信息是图片、表格、界面用户发来一张截图问这个功能在哪收到一张发票图片需要提取信息想让 Agent 自动操作浏览器/桌面多模态 Agent让 AI 从能说会道进化到能看会做。一、多模态 Agent 的三种类型┌─────────────────────────────────────────────────────┐ │ 多模态 Agent 能力矩阵 │ ├─────────────────────────────────────────────────────┤ │ 1. 看懂图片 │ │ → 视觉理解截图、照片、图表 OCR │ │ │ │ 2. 结合视觉工具 │ │ → 视觉 Tool Use自动化操作界面 │ │ │ │ 3. 主动感知行动 │ │ → 视觉 Agent像人一样操作电脑 │ └─────────────────────────────────────────────────────┘二、第一层视觉理解Vision Understanding2.1 基础图片描述from openai import AsyncOpenAI from pathlib import Path import base64 class ImageDescriber: 图片描述器 def __init__(self, client: AsyncOpenAI): self.client client async def describe(self, image_path: str) - str: 描述图片内容 # 读取图片并转 base64 with open(image_path, rb) as f: image_data base64.b64encode(f.read()).decode() response await self.client.chat.completions.create( modelgpt-4o, messages[ { role: user, content: [ { type: image_url, image_url: { url: fdata:image/jpeg;base64,{image_data}, detail: high } }, { type: text, text: 请详细描述这张图片的内容 } ] } ], max_tokens500 ) return response.choices[0].message.content # 使用示例 client AsyncOpenAI() describer ImageDescriber(client) description await describer.describe(invoice.jpg) # 输出这是一张发票图片显示...2.2 进阶OCR 信息提取from dataclasses import dataclass from typing import Optional, List import re dataclass class ExtractedField: name: str value: str confidence: float class InvoiceExtractor: 发票信息提取器 def __init__(self, client: AsyncOpenAI): self.client client async def extract(self, image_path: str) - List[ExtractedField]: 从发票图片提取信息 with open(image_path, rb) as f: image_data base64.b64encode(f.read()).decode() # 设计 Schema 让输出更结构化 response await self.client.chat.completions.create( modelgpt-4o, messages[ { role: user, content: [ { type: image_url, image_url: { url: fdata:image/jpeg;base64,{image_data} } }, { type: text, text: 请从这张发票图片中提取以下信息并以 JSON 格式返回 { invoice_no: 发票号码, date: 开票日期, seller: 销售方, buyer: 购买方, amount: 金额, tax: 税额 } 如果某项信息无法识别请返回 N/A. } ] } ], max_tokens300, response_format{type: json_object} ) import json data json.loads(response.choices[0].message.content) return [ ExtractedField(namek, valuev, confidence1.0) for k, v in data.items() ] # 使用示例 extractor InvoiceExtractor(client) fields await extractor.extract(invoice.jpg) for field in fields: print(f{field.name}: {field.value}) # 输出 # invoice_no: 0112023001 # date: 2026-05-24 # seller: XXX 公司 # buyer: YYY 公司 # amount: 1000.00 # tax: 130.002.3 应用场景场景技能示例发票处理OCR 结构化自动提发票信息客服截图界面理解理解用户发的截图数据图表图表解读从截图提取数据手写文字OCR识别手写信件三、第二层视觉 Tool Use3.1 基础看懂界面后操作把视觉理解和工具调用结合起来from dataclasses import dataclass from typing import List, Optional from enum import Enum class UIAction(Enum): CLICK click TYPE type SCROLL scroll WAIT wait SELECT select dataclass class UIElement: type: str # button, input, link, etc. text: Optional[str] None selector: Optional[str] None xpath: Optional[str] None coordinates: Optional[tuple] None # (x, y) dataclass class UIAutomationResult: action: UIAction element: UIElement success: bool output: Optional[str] None class VisualAutomationAgent: 基于视觉的自动化 Agent def __init__(self, vision_client, automation_tools): self.client vision_client self.tools automation_tools # Playwright/Selenium 等 async def understand_and_act(self, task: str, screenshot_path: str) - str: 理解截图执行任务 # 1. 理解当前界面 interface_desc await self._describe_interface(screenshot_path) # 2. 结合任务决定下一步 action_plan await self._plan_action(task, interface_desc) # 3. 执行 result await self._execute_actions(action_plan) return result async def _describe_interface(self, screenshot_path: str) - str: 描述界面元素 with open(screenshot_path, rb) as f: image_data base64.b64encode(f.read()).decode() response await self.client.chat.completions.create( modelgpt-4o, messages[ { role: user, content: [ { type: image_url, image_url: { url: fdata:image/jpeg;base64,{image_data} } }, { type: text, text: 请描述这个界面包含的元素 1. 有哪些按钮请说出按钮上的文字 2. 有哪些输入框 3. 当前页面显示了什么内容 4. 哪里可能有我要找的内容 } ] } ], max_tokens500 ) return response.choices[0].message.content async def _plan_action(self, task: str, interface_desc: str) - dict: 决定下一步动作 prompt f当前界面描述 {interface_desc} 用户任务{task} 请决定需要做什么操作返回 JSON {{ action: click/type/scroll/wait, target_element: 元素的文字或描述, value: 如果是输入要输入什么 }} response await self.client.chat.completions.create( modelgpt-4o, messages[{role: user, content: prompt}], response_format{type: json_object} ) import json return json.loads(response.choices[0].message.content) async def _execute_actions(self, action_plan: dict) - str: 执行动作 action action_plan[action] if action click: # 点击元素 await self.tools.click(action_plan[target_element]) elif action type: # 输入文本 await self.tools.type(action_plan[target_element], action_plan[value]) elif action scroll: await self.tools.scroll(action_plan.get(direction, down)) # 等待页面加载 await self.tools.wait(1) return 操作完成3.2 进阶自动遍历查找class VisualFinder: 视觉元素查找器 def __init__(self, automation_agent: VisualAutomationAgent): self.agent automation_agent async def find_and_click(self, target_text: str, max_scrolls: int 10) - bool: 滚动查找并点击目标 for i in range(max_scrolls): # 截取当前界面 screenshot await self.agent.tools.screenshot() # 让模型判断有没有目标 found await self._check_if_found(target_text, screenshot) if found: # 点击 await self.agent.tools.click(target_text) return True # 没找到向下滚动 await self.agent.tools.scroll(down) await self.agent.tools.wait(0.5) return False async def _check_if_found(self, target: str, screenshot_path: str) - bool: 检查目标是否在当前界面 with open(screenshot_path, rb) as f: image_data base64.b64encode(f.read()).decode() response await self.agent.client.chat.completions.create( modelgpt-4o, messages[ { role: user, content: [ { type: image_url, image_url: { url: fdata:image/jpeg;base64,{image_data} } }, { type: text, text: f请判断这个界面是否包含 {target} 这个内容只回答 YES 或 NO。 } ] } ] ) return YES in response.choices[0].message.content.upper()四、第三层Desktop AgentmacOS/Windows4.1 基础屏幕读取控制class DesktopAgent: 桌面 Agent def __init__(self): self.screenshot_tool None # pyautogui / pygetwindow self.control_tool None async def execute_task(self, task: str) - str: 执行桌面任务 # 1. 截取屏幕 screenshot self._capture_screen() # 2. 理解当前状态 state await self._understand_state(screenshot) # 3. 规划动作 actions await self._plan_actions(task, state) # 4. 执行 for action in actions: await self._execute(action) return 任务完成 def _capture_screen(self) - bytes: 截取屏幕 import pyautogui return pyautogui.screenshot() async def _understand_state(self, screenshot) - str: 理解屏幕状态调用视觉模型 # 同前文的界面描述 pass async def _plan_actions(self, task: str, state: str) - list: 规划动作 pass async def _execute(self, action: dict): 执行动作 import pyautogui action_type action[type] if action_type click: pyautogui.click(action[x], action[y]) elif action_type type: pyautogui.typewrite(action[text]) elif action_type hotkey: pyautogui.hotkey(*action[keys]) elif action_type scroll: pyautogui.scroll(action[amount])4.2 高级根据描述操作class NaturalDesktopControl: 自然语言桌面控制 def __init__(self, llm, desktop_agent: DesktopAgent): self.llm llm self.agent desktop_agent async def natural_control(self, instruction: str) - str: 自然语言控制桌面 # 1. 截取当前屏幕 screenshot self.agent._capture_screen() # 2. 用 LLM 理解指令当前界面 actions await self._interpret_instruction(instruction, screenshot) # 3. 执行 for action in actions: await self.agent._execute(action) return f已执行{instruction} async def _interpret_instruction(self, instruction: str, screenshot) - list: 解析自然语言指令 prompt f你需要根据用户的自然语言指令控制电脑完成操作。当前屏幕见附件图片用户指令{instruction} 请确定需要做什么操作返回 JSON 数组 [ {{type: click, x: 100, y: 200}}, {{type: type, text: hello}}, {{type: hotkey, keys: [cmd, c]}} ] 可能的动作类型 - click: 点击 (x, y) 坐标 - type: 输入文本 - hotkey: 快捷键 - scroll: 滚动 - wait: 等待请只返回 JSON不要其他内容。 response await self.llm.chat([ {role: user, content: prompt} ], imagescreenshot) import json return json.loads(response.content)4.3 实用场景场景应用自动化测试自动操作 GUI 应用数据录入读图片/表格自动填表单批量处理批量处理文件会议纪要自动截屏 OCR 记录五、实战GPT-4o/Gemini 视觉模型对比5.1 支持的模型模型视觉输入支持场景价格GPT-4o✅ 128k截图、图表、照片$5/1M inpGemini 2.0 Flash✅ 10M视频理解免费Claude 3✅ 200k超长图像分析$3/1M inp通义千眼⏳国内待定5.2 代码对比# GPT-4o 视觉 response await openai.chat.completions.create( modelgpt-4o, messages[{ role: user, content: [ {type: image_url, image_url: {url: https://...}}, {type: text, text: 描述这个图片} ] }] ) # Gemini 2.0 Flash 视觉 response await genai.chat.completions.create( modelgemini-2.0-flash-exp, contents[ {role: user, parts: [ {inline_data: {mime_type: image/jpeg, data: ...}}, {text: 描述这个图片} ]} ] )5.3 选择建议快速原型GPT-4o稳定、效果好超高分辨率Claude 3支持最大免费/视频Gemini 2.0 Flash国内场景等通义千眼六、总结多模态 Agent 发展路线阶段1视觉理解现在 ↓ 学会看图片理解内容阶段2视觉工具本月 ↓ 看到 → 分析 → 操作阶段3桌面 Agent下个月 ↓ 像人一样操作电脑阶段4持续感知3个月后 ← 主动感知环境变化做出反应核心思想多模态是 Agent 从对话到行动的关键一步。文字 Agent 再强不会看图、不会操作界面应用场景就受限。多模态 Agent 让 AI 从能说会道变成能看会做。踩坑经验汇总图片质量很重要——模糊的图片严重影响识别效果坐标需要校准——不同分辨率下坐标不一样视觉理解有延迟——处理图片需要额外 1-3 秒模型幻觉——图片识别偶尔会脑补需要验证权限问题——桌面控制需要无障碍权限授权本篇代码https://github.com/dazhuang-zs/run_little_donkey/blob/master/docs/articles/ai-dev-advanced-05-multimodal-agent.mdAI 开发进阶系列10篇全文完系列回顾与下一步已完成的 10 篇序号标题状态基础 1-9API、KV Cache、Agent Loop、Tool Use、Reasoning、MCP、Memory/RAG、Multi-Agent、Prompt/ContextEngineering✅进阶 ①生产级 Agent 评估体系✅进阶 ②AI 系统可观测性✅进阶 ③推理加速与成本控制✅进阶 ④Context Engineering 深入✅进阶 ⑤多模态 Agent 实战✅进阶方向推荐项目驱动挑一个小项目用这 10 篇的内容做一个完整 Agent垂直领域选择一个领域客服、代码、数据深入研究该领域的 Agent 设计面试准备这 10 篇内容足够覆盖大多数 AI 开发工程师的面试题祝你在 AI 开发的路上越走越远作者AI小渔村许可署名-非商业性使用-相同方式共享

你的数字记忆正在消失？三步永久保存微信聊天记录

你的数字记忆正在消失？三步永久保存微信聊天记录【免费下载链接】WeChatMsg 提取微信聊天记录，将其导出成HTML、Word、CSV文档永久保存，对聊天记录进行分析生成年度聊天报告项目地址: https://gitcode.com/GitHub_Trending/we/WeChatMsg …

2026/5/24 11:21:04 阅读更多

如何快速掌握开源笔记工具：Xournal++ 终极使用指南

如何快速掌握开源笔记工具：Xournal 终极使用指南【免费下载链接】xournalpp Xournal is a handwriting notetaking software with PDF annotation support. Written in C with GTK3, supporting Linux (e.g. Ubuntu, Debian, Arch, SUSE), macOS and Windows 10. S…

2026/5/24 11:20:23 阅读更多

CompressO：免费开源视频压缩工具，让大文件轻松变小

CompressO：免费开源视频压缩工具，让大文件轻松变小【免费下载链接】compressO Convert any video/image into a tiny size. 100% free & open-source. Available for Mac, Windows & Linux. 项目地址: https://gitcode.com/gh_mirrors/co/com…

2026/5/24 11:20:03 阅读更多

中小学电子课本下载工具：3步解决教师备课资源获取难题

中小学电子课本下载工具：3步解决教师备课资源获取难题【免费下载链接】tchMaterial-parser 国家中小学智慧教育平台电子课本下载工具，帮助您从智慧教育平台中获取电子课本的 PDF 文件网址并进行下载，让您更方便地获取课本内容。项目地址…

2026/5/24 12:08:50 阅读更多

动力系统与机器学习融合：破解Sabra壳模型自相似爆破的非唯一性

1. 项目概述：当湍流奇点遇上动力系统与机器学习在流体动力学的世界里，有限时间奇点（Blowup）的形成一直是个迷人的谜题。想象一下，一个初始光滑的流体运动，在有限时间内，其速度或涡量等物理量突然…

2026/5/24 12:08:09 阅读更多

论文查重居然能白嫖？书匠策AI这个隐藏功能，99%的学生还不知道！

嗨，各位正在跟论文"死磕"的同学们，我是你们的论文写作科普博主。今天这期内容，专门讲一个大多数人都在花冤枉钱的事——论文查重。先问大家一个问题：你查一次重，花了多少钱？ 知网少说两三百…

2026/5/24 12:07:49 阅读更多

ZeroOmega代理管理工具：浏览器代理切换的终极解决方案

ZeroOmega代理管理工具：浏览器代理切换的终极解决方案【免费下载链接】ZeroOmega Manage and switch between multiple proxies quickly & easily. 项目地址: https://gitcode.com/gh_mirrors/ze/ZeroOmega 还在为繁琐的代理设置而烦恼吗？Ze…

2026/5/24 12:07:49 阅读更多

d2dx开源项目深度揭秘：如何用现代图形技术复活经典游戏的视觉体验

d2dx开源项目深度揭秘：如何用现代图形技术复活经典游戏的视觉体验【免费下载链接】d2dx D2DX is a complete solution to make Diablo II run well on modern PCs, with high fps and better resolutions. 项目地址: https://gitcode.com/gh_mirrors/d2/d2dx …

2026/5/24 12:07:07 阅读更多

如何在3分钟内完成Windows与Office批量激活：开源KMS工具完整指南

如何在3分钟内完成Windows与Office批量激活：开源KMS工具完整指南【免费下载链接】KMS_VL_ALL_AIO Smart Activation Script 项目地址: https://gitcode.com/gh_mirrors/km/KMS_VL_ALL_AIO 如果您正在寻找一个简单高效的Windows与Office批量激活解决方案&…

2026/5/24 12:06:47 阅读更多

施工现场安全事故预警准确率达94.6%？——解密某央企AI Agent边缘计算部署架构与3个月落地实录

更多请点击： https://codechina.net 第一章：施工现场安全事故预警准确率达94.6%？——解密某央企AI Agent边缘计算部署架构与3个月落地实录在华北某大型地铁盾构施工现场，一套轻量化AI Agent系统于2024年Q2完成全栈部署&#xff…

2026/5/24 0:01:12 阅读更多

附录 B：术语表

本术语表面向“从 MM 到 HMM”专栏阅读过程中的快速查阅。它不是内核 API 手册，而是把文章中反复出现的概念放到同一张地图上：先给出直观含义，再说明它在 Linux MM/HMM 语境里的作用。建议阅读方式： 初读专栏时，把它当…

2026/5/24 0:01:32 阅读更多

Midjourney渐变美学的神经渲染原理（附RGB-HSV-LCH三空间渐变映射对照表·行业首曝）

更多请点击： https://kaifayun.com 第一章：Midjourney渐变美学的神经渲染原理（附RGB-HSV-LCH三空间渐变映射对照表行业首曝） Midjourney 的渐变美学并非传统插值实现，而是由其隐式神经渲染器（Implicit Neu…

2026/5/24 0:02:33 阅读更多

施工现场安全事故预警准确率达94.6%？——解密某央企AI Agent边缘计算部署架构与3个月落地实录

2026/5/24 0:01:12 阅读更多

附录 B：术语表

2026/5/24 0:01:32 阅读更多

Midjourney渐变美学的神经渲染原理（附RGB-HSV-LCH三空间渐变映射对照表·行业首曝）

2026/5/24 0:02:33 阅读更多

MPC-BE：基于DirectShow架构的专业级开源媒体播放解决方案

MPC-BE：基于DirectShow架构的专业级开源媒体播放解决方案【免费下载链接】MPC-BE MPC-BE – универсальный проигрыватель аудио и видеофайлов для операционной системы Windows. 项目地址:…

2026/5/23 15:04:07 阅读更多

如何快速计算3D模型体积和重量：STL-Volume-Model-Calculator终极指南

如何快速计算3D模型体积和重量：STL-Volume-Model-Calculator终极指南【免费下载链接】STL-Volume-Model-Calculator STL Volume Model Calculator Python 项目地址: https://gitcode.com/gh_mirrors/st/STL-Volume-Model-Calculator 你是否曾经为3D打印项目…

2026/5/23 12:38:32 阅读更多

通过Taotoken CLI工具一键配置团队开发环境与模型密钥

通过Taotoken CLI工具一键配置团队开发环境与模型密钥 1. CLI工具安装与基本使用 Taotoken提供的CLI工具可通过npm全局安装或直接使用npx运行。对于需要频繁使用CLI的团队，推荐全局安装： npm install -g taotoken/taotoken对于临时使用或项目级配置&a…

2026/5/24 9:50:45 阅读更多

相关文章

你的数字记忆正在消失？三步永久保存微信聊天记录

如何快速掌握开源笔记工具：Xournal++ 终极使用指南

CompressO：免费开源视频压缩工具，让大文件轻松变小

中小学电子课本下载工具：3步解决教师备课资源获取难题

动力系统与机器学习融合：破解Sabra壳模型自相似爆破的非唯一性

论文查重居然能白嫖？书匠策AI这个隐藏功能，99%的学生还不知道！

ZeroOmega代理管理工具：浏览器代理切换的终极解决方案

d2dx开源项目深度揭秘：如何用现代图形技术复活经典游戏的视觉体验

如何在3分钟内完成Windows与Office批量激活：开源KMS工具完整指南

施工现场安全事故预警准确率达94.6%？——解密某央企AI Agent边缘计算部署架构与3个月落地实录

附录 B：术语表

Midjourney渐变美学的神经渲染原理（附RGB-HSV-LCH三空间渐变映射对照表·行业首曝）

施工现场安全事故预警准确率达94.6%？——解密某央企AI Agent边缘计算部署架构与3个月落地实录

附录 B：术语表

Midjourney渐变美学的神经渲染原理（附RGB-HSV-LCH三空间渐变映射对照表·行业首曝）

MPC-BE：基于DirectShow架构的专业级开源媒体播放解决方案

如何快速计算3D模型体积和重量：STL-Volume-Model-Calculator终极指南

通过Taotoken CLI工具一键配置团队开发环境与模型密钥