Quantum software bugs often yield silent, incorrect outputs rather than explicit errors, making them particularly difficult to detect and repair with conventional techniques. Although large language models (LLMs) have shown strong performance on classical software engineering tasks, their ability to debug quantum code remains largely unexplored. To bridge this gap, we propose QBugLM, a multi-agent framework that automates the quantum software debugging pipeline, from taxonomy-driven bug injection to LLM-based detection and repair, and finally to simulation-based validation, for framework-agnostic OpenQASM 3.0 programs. We further conduct a comprehensive case study using QBugLM to benchmark two LLMs, Claude 4.6 Sonnet and Qwen3 Coder Next, across different prompting strategies, bug categories, and quantum programs. Our results show that iterative feedback is critical, as a single retry raises Pass@1 from below 25% to above 80%. Moreover, simpler structured prompting can even outperform Chain-of-Thought and ReAct for reasoning-capable models under fixed-resource constraints. Our work takes initial steps toward benchmarking LLM capabilities for debugging quantum programs and offers practical insights to support future efforts in automated quantum software repair.
翻译:量子软件缺陷往往产生静默的、不正确的输出,而非显式错误,这使得传统技术难以对其进行检测和修复。尽管大语言模型(LLM)在经典软件工程任务中展现出强劲性能,但其在量子代码调试方面的能力尚待深入探索。为填补这一空白,我们提出QBugLM——一个多智能体框架,可自动化量子软件调试流水线,涵盖从基于分类法的缺陷注入、基于LLM的检测与修复,到基于仿真的验证,且适用于框架无关的OpenQASM 3.0程序。通过使用QBugLM对Claude 4.6 Sonnet和Qwen3 Coder Next两个LLM进行基准测试,我们进一步开展了综合性案例研究,涵盖不同的提示策略、缺陷类别和量子程序。结果表明,迭代反馈至关重要——单次重试即可将Pass@1从低于25%提升至80%以上。此外,在固定资源约束下,对于具备推理能力的模型,更简单的结构化提示甚至可超越链式思维和ReAct方法。本研究率先为量子程序调试领域的大语言模型能力基准测试迈出初步探索,并为未来自动化量子软件修复提供实践性见解。