ARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.
翻译:ARM SVE与RISC-V RVV是高端处理器中新兴的向量架构,支持灵活向量长度的向量化运算。本研究以量子计算的关键负载——量子态矢量模拟为切入点,探究在向量长度无关(VLA)的设计范式下能否实现高性能的可移植性。我们提出了一套对实现高性能至关重要的VLA设计与优化技术,包括VLEN自适应内存布局调整、加载缓冲、细粒度循环控制以及基于量子门融合的算术强度适配。我们在Google的Qsim模拟器中实现了该方案,并在三款ARM处理器(包括NVIDIA Grace、AWS Graviton3和富士通A64FX)上对五个最高达36量子位的量子电路进行了评估。通过定义新的度量指标与PMU事件来量化向量化活动,我们为未来VLA架构设计提供了通用性见解。我们采用单一代码库实现的VLA量子模拟在A64FX上最高获得4.5倍加速,在Grace上获得2.5倍加速,在Graviton上获得1.5倍加速。