Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

QEM is widely regarded as a plausible bridge from NISQ devices to FTQC. Yet the empirical studies used to assess the effectiveness of QEM techniques on concrete problems have received comparatively little scrutiny with respect to the validity of their conclusions. We systematically review 81 recent QEM papers using an eight-criterion framework covering statistical rigour, reproducibility, and reporting quality. Among the applicable papers, only 15 (25%) use inferential methods, while 25 (42%) report uncertainty only descriptively, without testing whether the claimed effects are statistically supported. To demonstrate the consequences of these omissions, we use ZNE as a representative and widely used case study and identify two compounding sources of artefacts in current QEM benchmarks. First, we observe parameter sensitivity: in a 132-configuration sweep, implicitly assumed choices such as scale factors, extrapolation method, and hardware calibration are not merely incidental but active, with variations changing conclusions from statistically significant improvement to statistically significant degradation. Second, we identify a drift-induced effectiveness illusion: in a 72-hour longitudinal study on real hardware, temporal drift alone can make the same ZNE configuration exhibit an effect size more than three times as large, depending solely on when it is executed, and also drastically reduces the effective number of independent observations. These findings do not imply that QEM methods are intrinsically unsound; rather, they show that current evaluation practice can make mitigation performance appear more robust than the evidence warrants. We therefore propose minimum reporting standards for QEM evaluations, including explicit parameter documentation, robustness checks, longitudinal drift assessment, and inferential statistical testing with effect-size reporting.

翻译：量子错误缓解（QEM）被普遍视为从含噪中等规模量子（NISQ）设备迈向容错量子计算（FTQC）的重要桥梁。然而，用于评估QEM技术在具体问题上有效性的实证研究，其结论的有效性却鲜少受到严格审视。我们采用一个涵盖统计严谨性、可复现性及报告质量的八项标准框架，系统性地审查了81篇近期QEM论文。在适用论文中，仅15篇（25%）采用了推断性统计方法，而25篇（42%）仅对不确定性进行了描述性报告，未检验所声称的效果是否具有统计支持。为揭示上述缺失的后果，我们以零噪声外推（ZNE）这一代表性且广泛使用的技术作为案例研究，识别出当前QEM基准测试中两类相互叠加的假象来源。首先，我们观察到参数敏感性：在覆盖132种配置的扫描实验中，诸如缩放因子、外推方法及硬件校准等隐含选择的参数并非无关紧要，其变化会彻底改变结论——从具有统计显著性的改善转变为具有统计显著性的性能退化。其次，我们发现了漂移诱导的有效性假象：在真实硬件上进行的72小时纵向研究表明，单独的时间漂移即可使相同ZNE配置的效果量（效应大小）变化超过三倍，此结果完全取决于执行时间，同时还会大幅减少有效独立观测样本数量。这些发现并非暗示QEM方法本质上有缺陷，而是表明当前的评估实践可能使缓解性能显得比证据所支持的更为可靠。为此，我们提出QEM评估的最低报告标准，包括参数文档的明确记录、稳健性检验、纵向漂移评估，以及附带效应量报告的推断性统计检验。