BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning

This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task. Our study is available at https://github.com/RainYuGG/BLIP-Adapter

翻译：本研究旨在探索屏幕截图描述任务的高效调优方法。近年来，图像描述技术取得了显著进展，但针对移动端屏幕的描述任务研究仍相对匮乏。描述产品截图中用户行为的现有数据集与用例尤为有限。为此，我们尝试对预训练模型进行微调以适配屏幕截图描述任务。然而，由于图像描述模型参数量庞大，微调大型预训练模型需要消耗大量时间、算力与存储资源。针对这一挑战，本研究提出一种适配器方法组合方案，仅需对模型中的新增模块进行调优。这些方法最初专为视觉或语言任务设计，我们将其应用于解决屏幕截图描述中的类似问题。通过冻结图像描述模型的参数并仅训练与方法相关的权重，可在显著减少参数量的同时保持与全模型微调相当的性能。本研究首次系统探究了适配器组合在屏幕截图描述任务中的有效性。通过实验与分析，本研究旨在为适配器在视觉-语言模型中的应用提供重要洞见，并为屏幕截图描述任务的高效调优技术开发做出贡献。本研究成果请参阅 https://github.com/RainYuGG/BLIP-Adapter

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日