AFRAgent：一种基于自适应特征重归一化的高分辨率感知图形用户界面代理 (AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent) - 专知论文

会员服务 ·

0

自动化 · 自适应 · 高分辨 · 高分辨率 · 归一化 ·

2025 年 12 月 11 日

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

翻译：AFRAgent：一种基于自适应特征重归一化的高分辨率感知图形用户界面代理

Neeraj Anand,Rishabh Jain,Sohan Patnaik,Balaji Krishnamurthy,Mausoom Sarkar

from arxiv, Accepted at WACV 2026 Conference

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

翻译：移动用户界面（UI）自动化的需求日益增长，因其在各行业的广泛应用而备受关注。随着视觉语言模型（VLMs）的出现，GUI自动化已从生成基于文本的人工指令发展为自主执行任务，从而优化了自动化工作流程。近期方法利用VLMs解决此问题，得益于其能够：1）直接处理屏幕内容；2）通过模拟人类操作（如点击、输入）保持与设备特定API的独立性；3）运用现实世界情境知识进行任务理解。然而，由于视觉编码器特征中空间信息有限，这些模型常难以准确定位界面组件并确定操作。此外，性能最优的模型通常规模庞大，需要大量训练并导致推理延迟。本研究提出AFRAgent，一种基于instruct-BLIP的多模态架构，其在GUI自动化中实现卓越性能，而模型规模仅为最接近竞争者的四分之一以下。为增强大语言模型（LLM）流程中的图像嵌入，我们提出一种基于自适应特征重归一化（即令牌级仿射变换）的技术，有效增强低分辨率图像嵌入并融合高分辨率细节。我们在Meta-GUI和AITW基准测试中评估AFRAgent，为智能手机自动化建立了新的最先进基准。

0

相关内容

自动化

机器或装置在无人干预的情况下按规定的程序或指令自动进行操作或控制的过程，是一门涉及学科较多、应用广泛的综合性科学技术。

【CVPR2024】VidLA: 大规模视频-语言对齐

【CVPR2024】VidLA: 大规模视频-语言对齐

专知会员服务

20+阅读 · 2024年3月31日

【WWW2024】GraphPro：推荐系统中的图预训练与提示学习

【WWW2024】GraphPro：推荐系统中的图预训练与提示学习

专知会员服务

23+阅读 · 2024年1月26日

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【CVPR2022】EDTER：基于Transformer的边缘检测（CVPR2022）

【CVPR2022】EDTER：基于Transformer的边缘检测（CVPR2022）

专知会员服务

33+阅读 · 2022年3月18日

ICML'21：一种计算用户嵌入表示的新型协同过滤方法

ICML'21：一种计算用户嵌入表示的新型协同过滤方法

专知会员服务

15+阅读 · 2021年12月31日

【NeurIPS2021】ResT:一个有效的视觉识别转换器

【NeurIPS2021】ResT:一个有效的视觉识别转换器

专知会员服务

23+阅读 · 2021年10月25日

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

专知会员服务

19+阅读 · 2020年10月21日

【KDD2020】多任务多关系嵌入的Twitter意识形态检测，TIMME-Twitter Ideology-detection via Multi-task Multi-relational Embedding

【KDD2020】多任务多关系嵌入的Twitter意识形态检测，TIMME-Twitter Ideology-detection via Multi-task Multi-relational Embedding

专知会员服务

18+阅读 · 2020年6月8日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【DeepMind】PolyGen: 一种三维网格的自回归生成模型，PolyGen: An Autoregressive Generative Model of 3D Meshes

【DeepMind】PolyGen: 一种三维网格的自回归生成模型，PolyGen: An Autoregressive Generative Model of 3D Meshes

专知会员服务

37+阅读 · 2020年2月27日

【ACMMM2020-北航】KBGN:用于视觉对话中自适应视觉-文本推理的知识桥图网络

【ACMMM2020-北航】KBGN:用于视觉对话中自适应视觉-文本推理的知识桥图网络

专知

10+阅读 · 2020年8月12日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

TensorFlow 2.0新特性之Ragged Tensor

TensorFlow 2.0新特性之Ragged Tensor

深度学习每日摘要

18+阅读 · 2019年4月5日

Deep Image Prior——图像恢复入门

Deep Image Prior——图像恢复入门

中国人工智能学会

15+阅读 · 2019年2月16日

SkeletonNet：完整的人体三维位姿重建方法

SkeletonNet：完整的人体三维位姿重建方法

计算机视觉life

21+阅读 · 2019年1月21日

DeepMind：用PopArt进行多任务深度强化学习

DeepMind：用PopArt进行多任务深度强化学习

论智

29+阅读 · 2018年9月14日

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

AI研习社

11+阅读 · 2018年6月27日

用Rasa NLU构建自己的中文NLU系统

用Rasa NLU构建自己的中文NLU系统

待字闺中

18+阅读 · 2017年9月18日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

43+阅读 · 2015年12月31日

基于抽象语义切片和后向求精分析的静态分析警报自动确认研究

国家自然科学基金

1+阅读 · 2015年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于缺失数据分析和信息几何理论的SAR图像自动目标识别研究

国家自然科学基金

3+阅读 · 2015年12月31日

“数据-知识”驱动的大区域高分辨率遥感影像多尺度分割并行计算方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于第三方的APP软件质量度量和评估方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于深度学习的特征融合在移动机器人视觉中的场景理解及研究

国家自然科学基金

12+阅读 · 2014年12月31日

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Arxiv

42+阅读 · 2023年4月19日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

154+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Knowledge Graphs: Opportunities and Challenges

Arxiv

181+阅读 · 2023年3月24日

A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?

Arxiv

88+阅读 · 2023年3月21日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2024】VidLA: 大规模视频-语言对齐

【CVPR2024】VidLA: 大规模视频-语言对齐

专知会员服务

20+阅读 · 2024年3月31日

【WWW2024】GraphPro：推荐系统中的图预训练与提示学习

【WWW2024】GraphPro：推荐系统中的图预训练与提示学习

专知会员服务

23+阅读 · 2024年1月26日

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【CVPR2022】EDTER：基于Transformer的边缘检测（CVPR2022）

【CVPR2022】EDTER：基于Transformer的边缘检测（CVPR2022）

专知会员服务

33+阅读 · 2022年3月18日

ICML'21：一种计算用户嵌入表示的新型协同过滤方法

ICML'21：一种计算用户嵌入表示的新型协同过滤方法

专知会员服务

15+阅读 · 2021年12月31日

【NeurIPS2021】ResT:一个有效的视觉识别转换器

【NeurIPS2021】ResT:一个有效的视觉识别转换器

专知会员服务

23+阅读 · 2021年10月25日

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

专知会员服务

19+阅读 · 2020年10月21日

【KDD2020】多任务多关系嵌入的Twitter意识形态检测，TIMME-Twitter Ideology-detection via Multi-task Multi-relational Embedding

【KDD2020】多任务多关系嵌入的Twitter意识形态检测，TIMME-Twitter Ideology-detection via Multi-task Multi-relational Embedding

专知会员服务

18+阅读 · 2020年6月8日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【DeepMind】PolyGen: 一种三维网格的自回归生成模型，PolyGen: An Autoregressive Generative Model of 3D Meshes

【DeepMind】PolyGen: 一种三维网格的自回归生成模型，PolyGen: An Autoregressive Generative Model of 3D Meshes

专知会员服务

37+阅读 · 2020年2月27日

热门VIP内容

开通专知VIP会员享更多权益服务

论学习、公平性与复杂度

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

2025中国人工智能学会系列白皮书⸺棋盘上的人工智能|附下载

通用智能体评估的逻辑架构

相关资讯

【ACMMM2020-北航】KBGN:用于视觉对话中自适应视觉-文本推理的知识桥图网络

【ACMMM2020-北航】KBGN:用于视觉对话中自适应视觉-文本推理的知识桥图网络

专知

10+阅读 · 2020年8月12日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

TensorFlow 2.0新特性之Ragged Tensor

TensorFlow 2.0新特性之Ragged Tensor

深度学习每日摘要

18+阅读 · 2019年4月5日

Deep Image Prior——图像恢复入门

Deep Image Prior——图像恢复入门

中国人工智能学会

15+阅读 · 2019年2月16日

SkeletonNet：完整的人体三维位姿重建方法

SkeletonNet：完整的人体三维位姿重建方法

计算机视觉life

21+阅读 · 2019年1月21日

DeepMind：用PopArt进行多任务深度强化学习

DeepMind：用PopArt进行多任务深度强化学习

论智

29+阅读 · 2018年9月14日

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

AI研习社

11+阅读 · 2018年6月27日

用Rasa NLU构建自己的中文NLU系统

用Rasa NLU构建自己的中文NLU系统

待字闺中

18+阅读 · 2017年9月18日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

相关论文

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Arxiv

42+阅读 · 2023年4月19日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

154+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Knowledge Graphs: Opportunities and Challenges

Arxiv

181+阅读 · 2023年3月24日

A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?

Arxiv

88+阅读 · 2023年3月21日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

相关基金

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

43+阅读 · 2015年12月31日

基于抽象语义切片和后向求精分析的静态分析警报自动确认研究

国家自然科学基金

1+阅读 · 2015年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于缺失数据分析和信息几何理论的SAR图像自动目标识别研究

国家自然科学基金

3+阅读 · 2015年12月31日

“数据-知识”驱动的大区域高分辨率遥感影像多尺度分割并行计算方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于第三方的APP软件质量度量和评估方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于深度学习的特征融合在移动机器人视觉中的场景理解及研究

国家自然科学基金

12+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员