While passive agents merely follow instructions, proactive agents align with higher-level objectives, such as assistance and safety by continuously monitoring the environment to determine when and how to act. However, developing proactive agents is hindered by the lack of specialized resources. To address this, we introduce ProAct-75, a benchmark designed to train and evaluate proactive agents across diverse domains, including assistance, maintenance, and safety monitoring. Spanning 75 tasks, our dataset features 91,581 step-level annotations enriched with explicit task graphs. These graphs encode step dependencies and parallel execution possibilities, providing the structural grounding necessary for complex decision-making. Building on this benchmark, we propose ProAct-Helper, a reference baseline powered by a Multimodal Large Language Model (MLLM) that grounds decision-making in state detection, and leveraging task graphs to enable entropy-driven heuristic search for action selection, allowing agents to execute parallel threads independently rather than mirroring the human's next step. Extensive experiments demonstrate that ProAct-Helper outperforms strong closed-source models, improving trigger detection mF1 by 6.21%, saving 0.25 more steps in online one-step decision, and increasing the rate of parallel actions by 15.58%.
翻译:被动式智能体仅遵循指令,而主动式智能体则通过持续监控环境以决定何时及如何行动,从而与更高层次的目标(如辅助与安全)保持一致。然而,专业资源的缺乏阻碍了主动式智能体的发展。为此,我们提出了ProAct-75基准,旨在跨辅助、维护与安全监控等多个领域训练和评估主动式智能体。该数据集涵盖75项任务,包含91,581个步骤级标注,并辅以显式的任务图进行丰富。这些任务图编码了步骤间的依赖关系与并行执行可能性,为复杂决策提供了必要的结构化基础。基于此基准,我们进一步提出了ProAct-Helper参考基线,该基线由多模态大语言模型驱动,将决策建立在状态检测的基础上,并利用任务图实现基于熵驱动的启发式搜索以进行动作选择,使智能体能够独立执行并行线程而非简单模仿人类的下一步行动。大量实验表明,ProAct-Helper优于强大的闭源模型,在触发检测mF1上提升了6.21%,在线单步决策中平均节省0.25个步骤,并将并行动作率提高了15.58%。