Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ($p = 0.026$) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.
翻译:近期研究将大语言模型应用于源代码漏洞检测,但多数评估仍采用忽略时间维度的随机训练-测试划分方法,从而高估了实际性能。在实践中,检测器需部署于持续演化的代码库中,并必须在时序分布偏移下识别未来漏洞。本文基于2018-2024年跨度的CVE关联数据集(按双月窗口组织),研究了解码器式语言模型的持续微调方法(采用microsoft/phi-2模型与LoRA技术)。我们评估了八种持续学习策略,包括仅窗口训练、累积训练、基于重放的基线方法及基于正则化的变体。本文提出混合类感知选择性重放方法,这是一种面向二元漏洞分类的置信度感知重放方法,其通过优先选择不确定性样本,同时在重放缓冲区中维持漏洞函数与修复函数的平衡比例。在双月前向评估中,Hybrid-CASR取得了0.667的宏观F1值,较仅窗口基线(0.651)提升0.016,且具有统计学显著增益($p = 0.026$)与更强的后向保持能力(IBR@1达0.741)。与基线相比,Hybrid-CASR还将每个窗口的训练时间降低约17%,而累积训练仅带来微小的F1提升(0.661)却需付出15.9倍的计算成本。总体而言,研究结果表明:在持续时序漂移条件下,采用类别平衡的选择性重放策略能为基于大语言模型的时序漏洞检测提供精度与效率的实用权衡方案。