Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.
翻译:视觉语言模型正成为复杂多模态任务的通用接口。然而,其部署仍面临三大鸿沟:密集视频帧与长提示处理导致高昂延迟与成本,智能体框架在部署后无法进化,以及标准视频问答基准无法测试智能体在工具使用工作空间中的视觉推理能力。本文提出VisualClaw——一种基于两大原则构建的自进化多模态智能体:其一,混合编码通过级联门控滤除非信息流式帧,并利用热/冷top-k注入压缩文本技能库,从而降低部署成本;其二,技能进化使智能体从失败中学习——检索到的记忆作为直接拼接上下文或引导性证据输入进化器,生成技能库更新以辅助后续问题。在采用两种视觉语言模型的四个视频问答基准上,相较于全帧上传,VisualClaw使单问题API成本平均降低98%;相较于离线均匀8帧基线,成本降低25.9%,同时在多数场景下提升准确率,例如在Gemini 3 Flash的EgoSchema上平均提升3.85%,最高提升15.80%。为弥合现有基准的不足,我们构建了VisualClawArena——通过严格五阶段流程筛选出的200个场景多模态智能体基准;模型需在工作空间内利用视频证据、文档、动态更新与可执行校验。在该基准上,采用计算机使用智能体后端的同一框架相较于无进化基线,使Codex(GPT-5.5)宏观准确率提升2.9%,Claude Code(Sonnet 4.6)提升3.2%,同时比均匀采样基线降低9.5%的成本。这些特性使VisualClaw天然适配边缘应用:级联机制可将1小时流媒体会话的API调用次数从约3600次降至仅5-20次,而自进化能力使其成为完美的个性化助手。