Readability-Robust Code Summarization via Meta Curriculum Learning

Code summarization has emerged as a fundamental technique in the field of program comprehension. While code language models have shown significant advancements, the current models and benchmarks are confined to high-readability code, which contains sufficient semantic cues such as function and variable names. In the real world, however, code is often poorly structured or obfuscated, significantly degrading model performance. In this paper, we first empirically evaluate the robustness of state-of-the-art language models on poor-readability code for the task of code summarization, focusing on (1) their effectiveness, (2) the impact of prompt engineering, and (3) the robustness of different variants. Experimental results reveal that state-of-the-art models-including GPT-4o and DeepSeek-V3 experience a substantial performance drop when faced with poorly readable code, and that prompt engineering and reasoning-enhanced models offer limited improvements. Motivated by these findings, we propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code. RoFTCodeSum marries the concepts of curriculum learning and meta-learning: based on the original dataset for fine-tuning, it creates curricular training sets, e.g., obfuscating function names and identifiers from the code, respectively, that have progressive difficulty in code comprehension. In each training step, the approach meta-updates the gradients using these progressively challenging datasets, thereby optimizing both accuracy and readability robustness simultaneously. Experimental results demonstrate that RoFTCodeSum exhibits increased robustness against semantic perturbation while enhancing performance on the original code.

翻译：代码摘要已成为程序理解领域的一项基础技术。尽管代码语言模型已取得显著进展，但现有模型与基准测试均局限于高可读性代码，这类代码包含充分的语义线索，如函数名与变量名。然而在现实场景中，代码往往结构混乱或经过混淆处理，导致模型性能显著下降。本文首先通过实证评估了最先进语言模型在低可读性代码摘要任务中的鲁棒性，重点关注：（1）模型有效性，（2）提示工程的影响，以及（3）不同模型变体的鲁棒性。实验结果表明，包括GPT-4o和DeepSeek-V3在内的前沿模型在面对低可读性代码时均出现性能大幅下降，而提示工程与推理增强模型带来的改进有限。基于这些发现，我们提出RoFTCodeSum——一种新颖的微调方法，旨在增强代码摘要模型对低可读性代码的鲁棒性。该方法融合课程学习与元学习思想：基于原始微调数据集，通过分别混淆代码中的函数名与标识符等方式构建具有渐进理解难度的课程训练集。在每轮训练步骤中，该方法利用这些渐进挑战性数据集对梯度进行元更新，从而同步优化准确性与可读性鲁棒性。实验证明，RoFTCodeSum在提升原始代码性能的同时，对语义扰动表现出更强的鲁棒性。