Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.
翻译:语言模型(LMs)的最新进展推动了各类软件工程任务的显著进步。然而,由于数据质量、模型架构和推理能力的限制,现有语言模型在处理复杂编程场景时仍面临困难。本研究通过三个互补的方向系统性地应对这些挑战:(1)利用代码差异引导的对抗性增强技术(CODA)和代码去噪技术(CodeDenoise)提升代码数据质量;(2)通过语法引导的代码语言模型(LEAM和LEAM++)增强模型架构;(3)借助提示技术(muFiX)和基于智能体的技术(Specine)提升模型推理能力。这些技术旨在促进语言模型在软件开发中的实际应用,并进一步推动智能软件工程的发展。