Reliable large-scale quantum computation relies on fault-tolerant architectures, where quantum error correction (QEC) continuously extracts and decodes error syndromes in real time. A critical component in QEC is the decoder, a classical subsystem that must simultaneously deliver high logical accuracy and ultra-low latency. This paper presents a novel algorithm-hardware co-design that improves the accuracy-latency trade-off over existing approaches such as vanilla Minimum-Weight Perfect Matching (MWPM) and Union-Find (UF) decoders. At the algorithmic level, we introduce coset ensemble decoding, which improves UF decoding by explicitly exploiting logically equivalent cosets. Our method performs ensemble forest exploration to generate multiple coset-consistent candidates and aggregates them to approximate coset-level maximum-likelihood decoding. We further reduce computational and memory complexity via reverse-order elimination and lossless graph compression, without sacrificing accuracy. At the hardware level, we design a domain-specific architecture that temporally reuses resources, avoiding the code-distance-proportional resource growth in prior spatial architectures. Several optimizations, such as multi-bank memory hashing and hierarchical ID mapping, are proposed to mitigate pipeline stalls and memory conflicts under highly concurrent access patterns. Under a circuit-level depolarizing noise model, our co-design approach achieves a better accuracy-latency trade-off than prior MWPM- and UF-based decoders, while reducing FPGA LUT consumption by up to 8.2 times compared with reported UF-based decoder resources. The tunable candidate number further exposes a flexible design knob, enabling users to tailor decoding performance to the requirements of different fault-tolerant workloads. Our implementation is publicly available at https://github.com/IMSeonL/coset-ensemble-decoder.
翻译:可靠的规模化量子计算依赖于容错架构,其中量子纠错需实时持续提取并解码错误综合征。量子纠错的核心组件是解码器——这一经典子系统必须同时实现高逻辑精度与超低延迟。本文提出一种新颖的算法-硬件协同设计,相较于传统的最小权重完美匹配(MWPM)与并查集(UF)解码器,显著优化了精度-延迟权衡。在算法层面,我们引入陪集集成解码,通过显式利用逻辑等价陪集改进UF解码。该方法执行集成森林探索以生成多个陪集一致候选解,并通过聚合实现近似陪集级最大似然解码。我们进一步通过逆序消除与无损图压缩降低计算与内存复杂度,且不牺牲精度。在硬件层面,我们设计了一种领域专用架构,通过时间复用资源避免了先前空间架构中与码距成比例的资源增长。针对高并发访问模式下的流水线停顿与内存冲突问题,我们提出了多组内存哈希与层级ID映射等优化方案。在电路级退极化噪声模型下,我们的协同设计方案相比基于MWPM和UF的解码器实现了更优的精度-延迟权衡,同时将FPGA查找表消耗量降低至已报道UF解码器资源的8.2倍以下。可调候选解数量进一步提供了灵活的设计旋钮,使用户能够针对不同容错工作负载需求定制解码性能。我们的实现已开源至https://github.com/IMSeonL/coset-ensemble-decoder。