Architectural Limits of Cloud TPUs in Finite-Field Cryptography

We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4 architectures under an FP32-mantissa staging discipline, and a $\sim$$4{,}693\times$ deficit using v5p's native \texttt{int32} accumulator. We analytically project this deficit into a fundamental arithmetic penalty (lacking wide-integer ALUs) and a spatial penalty. We demonstrate that evaluating concurrent multi-tenant deployments, where strict separation forces eager Montgomery reduction, yields a projected $5.19\times$ spatial collapse; relaxing this constraint theoretically recovers these spatial cycles, yet the underlying arithmetic penalty remains. To facilitate this characterisation, we deploy \codename as a measurement vehicle. By mapping low-degree polynomials onto matrix-form Number Theoretic Transforms, the scheduler stacks heterogeneous polynomials into dense 2D matrices, achieving $\sim$$100\%$ K-dimension column occupancy on uniform workloads ($>$$92\%$ on mixed-degree traces). However, despite optimal K-dimension packing, severe M-dimension under-utilisation (e.g., $6.25\%$ on v4) combined with overwhelming VPU-bound Montgomery reduction stalls mathematically starve the systolic arrays. A post-hoc HLO validator ensures these measurements remain structurally isolated against the XLA fusion engine. Our findings empirically demonstrate the structural inadequacy of AI-optimised systolic arrays for exact, high-throughput field arithmetic.

翻译：我们通过实验刻画了云张量处理单元（TPU）与GPU在有限域密码学中的成本效率差距。以A100 GPU基线（cuZK）为参照，在FP32尾数分级策略下，我们测得v5p与v4架构的性能差距达$[5{,}558\times, 6{,}908\times]$；而利用v5p的原生\texttt{int32}累加器时，该差距约为$\sim$$4{,}693\times$。我们将这一差距解析投影为两项：基本算术惩罚（缺乏宽整数算术逻辑单元）与空间惩罚。实验表明，评估并发多租户部署场景（其中严格隔离迫使采用急切蒙哥马利约简）会导致投影的空间崩溃系数达$5.19\times$；放宽此约束理论上可恢复这些空间周期，但底层算术惩罚仍然存在。为便于表征，我们部署\codename作为测量载体。通过将低次多项式映射为矩阵形式数论变换，调度器将异构多项式堆叠成密集二维矩阵，在均匀工作负载下实现近$\sim$$100\%$的K维度列占用率（混合次数轨迹上$>$$92\%$）。然而，尽管K维度打包达到最优，M维度严重欠利用率（如v4上仅$6.25\%$）与占据绝对主导地位的VPU绑定蒙哥马利约简停滞，共同导致脉动阵列在数学意义上陷入饥饿。事后HLO验证器确保这些测量在结构上独立于XLA融合引擎。我们的实验结论表明，面向AI优化的脉动阵列在处理精确、高吞吐域算术时存在结构性缺陷。