How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

How code representation format shapes false positive behaviour in cross-language LLM vulnerability detection remains poorly understood. We systematically vary training intensity and code representation format, comparing raw source text with pruned Abstract Syntax Trees at both training time and inference time, across two 8B-parameter LLMs (Qwen3-8B and Llama 3.1-8B-Instruct) fine-tuned on C/C++ data from the NIST Juliet Test Suite (v1.3) and evaluated on Java (OWASP Benchmark v1.2) and Python (BenchmarkPython v0.1). Cross-language FPR reflects the joint effect of training-time and inference-time representation, not either alone. Text fine-tuning drives FPR upward monotonically (Qwen3-8B: 0.763 zero-shot, 0.866 pilot, 1.000 full-scale) while F1 remains stable (0.637-0.688), masking the collapse. We argue surface-cue memorisation is the primary mechanism: text fine-tuning encodes C/C++-specific API names and syntactic idioms as vulnerability triggers that fire indiscriminately on target-language code. A cross-representation probe, applying text-trained weights to AST-encoded input without retraining, isolates this: Qwen3-8B FPR drops from 0.866 to 0.583, and 37.2% of false positives revert to true negatives under AST input alone. Direct AST fine-tuning does not preserve the benefit (FPR at least 0.970), as flat linearisation introduces structural surface cues of its own. The pattern replicates across both model families. On BenchmarkPython the AST probe yields FPR=0.554, within 2.9 percentage points of the Java result, despite maximal surface-syntax differences, substantially weakening a domain-shift explanation. These findings motivate a pre-deployment consistency gate, running alerts through both text and AST paths, as a retraining-free filter for false-positive-sensitive settings, at the cost of reduced recall.

翻译：代码表示形式如何影响跨语言大语言模型（LLM）漏洞检测中的误报行为，仍是一个理解不足的问题。我们系统性地变化训练强度与代码表示形式，比较原始源代码与剪枝抽象语法树（AST）在训练阶段和推理阶段的表现，采用基于NIST Juliet测试套件（v1.3）中C/C++数据微调的两个8B参数LLM（Qwen3-8B与Llama 3.1-8B-Instruct），并在Java（OWASP Benchmark v1.2）与Python（BenchmarkPython v0.1）数据集上进行评估。跨语言误报率（FPR）反映了训练时与推理时表示形式的联合效应，而非单一因素作用。文本微调促使FPR单调上升（Qwen3-8B：零样本0.763，小规模0.866，全量1.000），而F1分数保持稳定（0.637-0.688），掩盖了这一性能崩塌。我们认为表面线索记忆是主要机制：文本微调将C/C++特有的API名称与句法习语编码为漏洞触发信号，这些信号在目标语言代码上无差别地激活。通过交叉表示探测——在不重新训练的情况下将文本训练权重应用于AST编码输入——可隔离这一效应：Qwen3-8B的FPR从0.866降至0.583，且37.2%的误报在仅使用AST输入时恢复为真阴性。直接进行AST微调无法保持此优势（FPR至少为0.970），因为扁平线性化引入了自身的结构性表面线索。该模式在两个模型家族中均得到复现。在BenchmarkPython上，AST探测得到FPR=0.554，与Java结果相差不到2.9个百分点（尽管表面句法差异极大），这显著削弱了域迁移解释。这些发现促使我们提出一种部署前一致性检测机制——通过文本与AST两条路径运行告警——作为无重训练过滤器用于误报敏感场景，但代价是召回率降低。