SCOPE: Real-Time Natural Language Camera Agent at the Edge

from arxiv, 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

翻译：在机器人领域部署语言驱动智能体需要能够反映真实任务需求的评估：即支持自然语言指令并具备可复现结果。这类智能体必须将语言模型与可调用的感知和控制工具相连接，并使用包括延迟、准确率和错误模式在内的部署关键指标进行评估。我们提出了SCOPE（感知与评估的仿真与摄像机操作），这是一个面向自然语言、开放词汇的云台变焦（PTZ）摄像机控制与视觉场景理解的模块化智能体，专为边缘部署而设计。SCOPE可在基于Blender的仿真环境和实体PTZ摄像机上运行，所有感知、规划和控制均在部署现场通过边缘可用计算资源本地执行。我们发布了一个包含536个任务的基准测试集，涵盖问答、单步和多步指令、计数、空间推理、描述以及光学字符识别，该基准在提供真实PTZ控制能力的Blender仿真环境中运行。执行轨迹与语言模型作为裁判（LM-as-Judge）相结合，用于评估延迟、准确率和错误模式。我们评估了19种规划器-感知模型组合，将Qwen3小型语言模型（SLM）与Moondream和Qwen视觉语言模型（VLM）配对。更强的SLM能显著减少幻觉并改进工具路由，从而实现更可靠的闭环行为。一旦使用了足够能力的SLM，感知便成为主要的性能瓶颈。在规划和感知两侧采用混合专家模型，其延迟和内存占用可与更小的网络相媲美，同时持续达到或超越稠密模型的性能。量化进一步提升了效率，且准确率损失极小，为实时、边缘可行的语言驱动PTZ控制确定了一个经仿真到实物验证的实用设计点。