LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

翻译：当前的计算机使用基准测试主要关注虚拟化系统中的软件操作任务，而科学仪器场景要求对复杂界面进行协调控制以及基于反馈的参数调整。然而，在高精度物理仪器上直接评估代理因成本高昂、安全风险、访问受限及难以确保可复现评估而不可行。这促使我们构建一个既保留科学仪器操作挑战、又支持可扩展安全评估的仿真现实测试平台。为此，我们提出LabOSBench——一个基于网络科学仪器模拟器套件的多模态图形用户界面代理挑战性基准测试。LabOSBench直接通过浏览器运行，避免了对资源密集型操作系统虚拟化的依赖，同时支持灵活的任务配置与基于执行的评估。具体而言，LabOSBench在八个仪器模拟器中构建了96个子任务，涵盖从样本加载、对准、参数调优、数据采集到结果检查的完整工作流。我们在子任务与端到端两个层面评估了通用视觉语言模型、专用GUI代理模型以及先进代理框架。实验表明，尽管现有代理能完成许多结构化GUI子任务，但在基于反馈的操作与长时序工作流执行上仍存在困难。总体而言，LabOSBench为推进计算机使用代理向科学仪器控制发展提供了可复现、低成本的测试平台。