We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4 architectures under an FP32-mantissa staging discipline, and a $\sim$$4{,}693\times$ deficit using v5p's native \texttt{int32} accumulator. We analytically project this deficit into a fundamental arithmetic penalty (lacking wide-integer ALUs) and a spatial penalty. We demonstrate that evaluating concurrent multi-tenant deployments, where strict separation forces eager Montgomery reduction, yields a projected $5.19\times$ spatial collapse; relaxing this constraint theoretically recovers these spatial cycles, yet the underlying arithmetic penalty remains. To facilitate this characterisation, we deploy \codename as a measurement vehicle. By mapping low-degree polynomials onto matrix-form Number Theoretic Transforms, the scheduler stacks heterogeneous polynomials into dense 2D matrices, achieving $\sim$$100\%$ K-dimension column occupancy on uniform workloads ($>$$92\%$ on mixed-degree traces). However, despite optimal K-dimension packing, severe M-dimension under-utilisation (e.g., $6.25\%$ on v4) combined with overwhelming VPU-bound Montgomery reduction stalls mathematically starve the systolic arrays. A post-hoc HLO validator ensures these measurements remain structurally isolated against the XLA fusion engine. Our findings empirically demonstrate the structural inadequacy of AI-optimised systolic arrays for exact, high-throughput field arithmetic.
翻译:我们通过实验刻画了云张量处理单元(TPU)与GPU在有限域密码学中的成本效率差距。以A100 GPU基线(cuZK)为参照,在FP32尾数分级策略下,我们测得v5p与v4架构的性能差距达$[5{,}558\times, 6{,}908\times]$;而利用v5p的原生\texttt{int32}累加器时,该差距约为$\sim$$4{,}693\times$。我们将这一差距解析投影为两项:基本算术惩罚(缺乏宽整数算术逻辑单元)与空间惩罚。实验表明,评估并发多租户部署场景(其中严格隔离迫使采用急切蒙哥马利约简)会导致投影的空间崩溃系数达$5.19\times$;放宽此约束理论上可恢复这些空间周期,但底层算术惩罚仍然存在。为便于表征,我们部署\codename作为测量载体。通过将低次多项式映射为矩阵形式数论变换,调度器将异构多项式堆叠成密集二维矩阵,在均匀工作负载下实现近$\sim$$100\%$的K维度列占用率(混合次数轨迹上$>$$92\%$)。然而,尽管K维度打包达到最优,M维度严重欠利用率(如v4上仅$6.25\%$)与占据绝对主导地位的VPU绑定蒙哥马利约简停滞,共同导致脉动阵列在数学意义上陷入饥饿。事后HLO验证器确保这些测量在结构上独立于XLA融合引擎。我们的实验结论表明,面向AI优化的脉动阵列在处理精确、高吞吐域算术时存在结构性缺陷。