UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process.
翻译:UI自动化测试在确保移动应用质量方面发挥着关键作用。尽管机器学习技术生成此类测试日益普及,但仍面临若干挑战,如UI元素匹配问题。大语言模型(LLMs)凭借其语义理解能力的最新进展已解决这些问题。然而,将这些模型应用于工业级应用测试仍存在显著差距,特别是在成本优化与知识局限性方面。为此,我们提出CAT系统,通过融合机器学习、大语言模型与最佳实践,为工业级应用创建成本效益型UI自动化测试。给定任务描述后,CAT采用检索增强生成(RAG)技术获取工业级应用使用范例作为少样本学习上下文,辅助大语言模型生成特定操作序列。随后,CAT运用机器学习技术(以大语言模型作为补充优化器)实现UI屏幕上目标元素的映射。在微信测试数据集上的评估表明,CAT能以0.34美元成本实现90%的UI自动化率,其性能与成本效益均优于现有最优方法。我们已将该方案集成至实际微信测试平台,实践证明其能有效检测141个错误并优化开发者的测试流程。