Android has become the predominant smartphone operating system, with a rapidly evolving ecosystem that requires app developers to frequently update their apps to maintain quality, security, and compatibility. While deep learning has made significant strides in various software engineering tasks, including automated code updates, existing methods are not specifically tailored for Android apps, and the potential of pre-trained Language Models of Code (CodeLMs) for updating Android app code remains unexplored. In this paper, we present the first comprehensive evaluation of state-of-the-art CodeLMs, including CodeT5, CodeBERT, CodeGPT, and UniXcoder, for recommending code updates in Android applications. To facilitate this evaluation, we curate a unique dataset of paired updated methods from 3,195 Android apps published on Google Play and hosted on GitHub between 2008 and 2022. Our findings demonstrate that pre-trained CodeLMs outperform traditional approaches, achieving a higher accuracy ranging from 190% to 385% under a realistic time-wise evaluation scenario. Among the CodeLMs, CodeT5 consistently exhibits superior performance across most code update types. Furthermore, we examine the impact of update types, evaluation scenarios, method size, and update size on the performance of CodeLMs, revealing areas for future research to improve temporal adaptability and generalization capabilities.
翻译:安卓已成为主导的智能手机操作系统,其生态系统快速演进,要求应用开发者频繁更新应用以维持质量、安全性和兼容性。尽管深度学习已在各类软件工程任务(包括自动化代码更新)中取得显著进展,但现有方法并未针对安卓应用进行专门定制,且预训练代码语言模型(CodeLMs)在更新安卓应用代码方面的潜力仍未得到探索。本文首次对包括CodeT5、CodeBERT、CodeGPT和UniXcoder在内的先进CodeLMs在推荐安卓应用代码更新任务中的性能进行了全面评估。为支撑此评估,我们整理了2008年至2022年间发布于Google Play并托管于GitHub的3195个安卓应用中成对更新的方法数据集。研究结果表明,在基于时间维度的真实评估场景下,预训练CodeLMs相比传统方法实现了190%至385%的更高准确率。在各类CodeLMs中,CodeT5在大多数代码更新类型上持续展现出更优性能。此外,我们考察了更新类型、评估场景、方法规模及更新规模对CodeLMs性能的影响,揭示了未来在提升时间适应性与泛化能力方面需要进一步研究的方向。