Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.
翻译:大型语言模型(LLM)日益广泛用于代码生成,然而量子代码生成目前仍主要在单一框架内评估,这难以区分量子推理能力与框架熟悉程度。我们提出QuanBench+,这是一个涵盖Qiskit、PennyLane和Cirq的统-基准测试,包含42个对齐任务,覆盖量子算法、量子门分解和量子态制备。我们通过可执行功能测试评估模型,报告Pass@1和Pass@5指标,并采用基于KL散度的接受机制处理概率性输出。我们进一步研究了基于反馈修复后的Pass@1性能,即模型在遭遇运行时错误或错误答案时可修改代码。跨框架比较显示,最强单次得分在Qiskit中达59.5%,Cirq中54.8%,PennyLane中42.9%;引入基于反馈的修复后,最优得分分别提升至83.3%、76.2%和66.7%。这些结果展现出明显进展,但也表明可靠的多框架量子代码生成问题仍未解决,且仍高度依赖于框架特定知识。