In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.
翻译:在AI驱动的人机图形用户界面(GUI)交互自动化领域,尽管多模态大语言模型和强化微调技术的快速发展已取得显著进展,但一个根本性挑战依然存在:其交互逻辑与自然的人-GUI沟通模式存在显著偏差。为填补这一空白,我们提出了“眨眼-思考-链接”(Blink-Think-Link,BTL)这一受大脑启发的框架,用于模拟用户与图形界面之间的人类认知过程。该系统将交互分解为三个生物学上合理的阶段:(1)眨眼(Blink)——快速检测并关注相关屏幕区域,类似于眼球的扫视运动;(2)思考(Think)——高层级的推理与决策制定,反映认知规划过程;(3)链接(Link)——生成可执行命令以实现精确的运动控制,模拟人类动作选择机制。此外,我们为BTL框架引入了两项关键技术创新:(1)眨眼数据生成(Blink Data Generation)——专门为眨眼数据优化的自动化标注流程,以及(2)BTL奖励(BTL Reward)——首个基于规则的奖励机制,支持由过程和结果共同驱动的强化学习。基于此框架,我们开发了一个名为BTL-UI的GUI智能体模型,该模型在综合基准测试中,在静态GUI理解和动态交互任务上均展现出具有竞争力的性能。这些结果为该框架在开发先进GUI智能体方面的有效性提供了确凿的实证验证。