TrojanWhisper: Evaluating Pre-trained LLMs to Detect and Localize Hardware Trojans

Existing Hardware Trojans (HT) detection methods face several critical limitations: logic testing struggles with scalability and coverage for large designs, side-channel analysis requires golden reference chips, and formal verification methods suffer from state-space explosion. The emergence of Large Language Models (LLMs) offers a promising new direction for HT detection by leveraging their natural language understanding and reasoning capabilities. For the first time, this paper explores the potential of general-purpose LLMs in detecting various HTs inserted in Register Transfer Level (RTL) designs, including SRAM, AES, and UART modules. We propose a novel tool for this goal that systematically assesses state-of-the-art LLMs (GPT-4o, Gemini 1.5 pro, and Llama 3.1) in detecting HTs without prior fine-tuning. To address potential training data bias, the tool implements perturbation techniques, i.e., variable name obfuscation, and design restructuring, that make the cases more sophisticated for the used LLMs. Our experimental evaluation demonstrates perfect detection rates by GPT-4o and Gemini 1.5 pro in baseline scenarios (100%/100% precision/recall), with both models achieving better trigger line coverage (TLC: 0.82-0.98) than payload line coverage (PLC: 0.32-0.46). Under code perturbation, while Gemini 1.5 pro maintains perfect detection performance (100%/100%), GPT-4o (100%/85.7%) and Llama 3.1 (66.7%/85.7%) show some degradation in detection rates, and all models experience decreased accuracy in localizing both triggers and payloads. This paper validates the potential of LLM approaches for hardware security applications, highlighting areas for future improvement.

翻译：现有硬件木马（HT）检测方法面临若干关键局限：逻辑测试难以应对大规模设计时的可扩展性与覆盖率问题；旁路分析需要黄金参考芯片；形式化验证方法则受困于状态空间爆炸。大语言模型（LLM）的出现为硬件木马检测提供了新的研究方向，其自然语言理解与推理能力具有重要潜力。本文首次探索了通用大语言模型在检测寄存器传输级（RTL）设计（包括SRAM、AES和UART模块）中各类硬件木马的能力。为此，我们提出了一种新型工具，系统评估了前沿大语言模型（GPT-4o、Gemini 1.5 pro和Llama 3.1）在未经微调情况下的硬件木马检测性能。为应对训练数据偏差问题，该工具实施了变量名混淆和设计重构等扰动技术，使测试案例对所用大语言模型更具挑战性。实验评估表明：在基准场景中，GPT-4o与Gemini 1.5 pro实现了完美的检测率（精确率/召回率均为100%），且两者在触发行覆盖率（TLC: 0.82-0.98）方面均优于载荷行覆盖率（PLC: 0.32-0.46）。在代码扰动条件下，Gemini 1.5 pro仍保持完美检测性能（100%/100%），而GPT-4o（100%/85.7%）与Llama 3.1（66.7%/85.7%）的检测率出现下降，所有模型在触发器和载荷定位的准确度均有所降低。本研究验证了大语言模型在硬件安全应用中的潜力，并指明了未来改进方向。