Efficient classical simulation of quantum Hamiltonian dynamics is often bottlenecked by exponential state growth and the overhead of generic sparse linear algebra. We introduce diagonal-budgeted Trotterization, a structure-aware strategy that decomposes Hamiltonians into factors preserving diagonal sparsity while tightly controlling fidelity loss. Our implementation, HamSim, utilizes a compact diagonal-sparse data layout and specialized C++/CUDA kernels to bypass the overheads of generic formats like CSR. By leveraging SIMD vectorization, multithreading, and GPU acceleration, HamSim achieves high performance across heterogeneous architectures. Benchmarks on the HamLib suite show that HamSim significantly outperforms Qiskit-Aer. On CPUs, HamSim attains speedups of $182$--$1,269\times$ on optimization instances (TSP, MaxCut) and $4.8$--$841\times$ on physical models (TFIM, Heisenberg). On GPUs, it achieves up to $178\times$ speedup for $12$--$16$ qubit problems. Unlike traditional Trotterization, HamSim maintains near-perfect fidelity without requiring exponential steps. This demonstrates that diagonal-aware numerical kernels provide a scalable foundation for high-fidelity classical Hamiltonian simulation.
翻译:量子哈密顿动力学的高效经典模拟常受限于指数级的状态增长以及通用稀疏线性代数带来的额外开销。我们提出“对角预算特罗特化”(diagonal-budgeted Trotterization),这是一种结构感知策略,将哈密顿量分解为保留对角稀疏性的因子,同时严格限制保真度损失。我们的实现工具HamSim采用了紧凑的对角稀疏数据布局以及定制的C++/CUDA内核,绕过了CSR等通用格式的额外开销。通过利用SIMD向量化、多线程及GPU加速,HamSim在异构架构上实现了高性能。在HamLib基准测试集上的结果表明,HamSim显著优于Qiskit-Aer。在CPU上,HamSim在优化问题实例(TSP、MaxCut)上实现了$182$--$1,269\times$的加速比,在物理模型(TFIM、海森堡模型)上实现了$4.8$--$841\times$的加速比。在GPU上,针对$12$--$16$量子比特问题,加速比最高可达$178\times$。与传统特罗特化不同,HamSim无需指数级步骤即可保持近完美保真度。这表明,对角感知数值内核为高保真经典哈密顿量模拟提供了可扩展的基础。