Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

Large language models (LLMs) for Verilog code generation are increasingly adopted in hardware design, yet remain vulnerable to backdoor attacks where adversaries inject malicious triggers during training to induce vulnerable hardware designs. Unlike patchable software vulnerabilities, hardware trojans become irreversible once fabricated, making remediation extremely costly or impossible. Existing active defenses require access to training data, impractical for third-party LLM users, while passive defenses struggle against semantically stealthy triggers that naturally blend into design specifications. In this paper, we hypothesize that under the requirements of both effectiveness and stealthiness, attackers are strongly biased toward embedding triggers in non-functional requirements (e.g., style modifiers, quality descriptors) rather than functional specifications that determine hardware behavior. Exploiting this insight, we propose Semantic Consensus Decoding (SCD), an inference-time passive defense with two key components: (1) functional requirement extraction that identifies essential requirements from user specifications, and (2) consensus decoding that adaptively fuses output distributions based on full user specifications and extracted functional requirements. When these distributions diverge significantly, SCD automatically suppresses suspicious components. Extensive experiments with three representative backdoor attacks demonstrate that SCD reduces average attack success rate from 89% to under 3% with negligible impact on generation quality.

翻译：用于Verilog代码生成的大语言模型（LLMs）在硬件设计中日益普及，但其仍易受后门攻击的威胁——攻击者可在训练阶段注入恶意触发器，从而诱导生成存在安全漏洞的硬件设计。与可修补的软件漏洞不同，硬件木马一旦制造完成即不可逆转，使得修复成本极高甚至无法实现。现有主动防御方案需访问训练数据，这对第三方LLM用户而言并不现实；而被动防御方案则难以应对语义隐蔽型触发器——这类触发器能自然地融入设计规范之中。本文提出假设：在兼顾攻击有效性与隐蔽性的双重约束下，攻击者会强烈倾向于将触发器嵌入非功能性需求（如样式修饰符、质量描述符），而非决定硬件行为的功能性规范。基于这一洞见，我们提出语义共识解码（SCD），一种推理阶段的被动防御方案，其包含两个核心组件：（1）功能性需求提取——从用户规范中识别核心需求；（2）共识解码——基于完整用户规范与提取的功能性需求自适应融合输出概率分布。当这些分布出现显著分歧时，SCD会自动抑制可疑成分。通过对三种典型后门攻击的广泛实验表明，SCD能将平均攻击成功率从89%降至3%以下，且对生成质量的影响可忽略不计。