Timing and Memory Telemetry on GPUs for AI Governance

The rapid expansion of GPU-accelerated computing has enabled major advances in large-scale artificial intelligence (AI), while heightening concerns about how accelerators are observed or governed once deployed. Governance is essential to ensure that large-scale compute infrastructure is not silently repurposed for training models, circumventing usage policies, or operating outside legal oversight. Because current GPUs expose limited trusted telemetry and can be modified or virtualized by adversaries, we explore whether compute-based measurements can provide actionable signals of utilization when host and device are untrusted. We introduce a measurement framework that leverages architectural characteristics of modern GPUs to generate timing- and memory-based observables that correlate with compute activity. Our design draws on four complementary primitives: (1) a probabilistic, workload-driven mechanism inspired by Proof-of-Work (PoW) to expose parallel effort, (2) sequential, latency-sensitive workloads derived via Verifiable Delay Functions (VDFs) to characterize scalar execution pressure, (3) General Matrix Multiplication (GEMM)-based tensor-core measurements that reflect dense linear-algebra throughput, and (4) a VRAM-residency test that distinguishes on-device memory locality from off-chip access through bandwidth-dependent hashing. These primitives provide statistical and behavioral indicators of GPU engagement that remain observable even without trusted firmware, enclaves, or vendor-controlled counters. We evaluate their responses to contention, architectural alignment, memory pressure, and power overhead, showing that timing shifts and residency latencies reveal meaningful utilization patterns. Our results illustrate why compute-based telemetry can complement future accountability mechanisms by exposing architectural signals relevant to post-deployment GPU governance.

翻译：GPU 加速计算的快速扩展推动了大规模人工智能（AI）领域的重大进展，同时也加剧了人们对加速器部署后如何被观测或治理的担忧。治理对于确保大规模计算基础设施不被暗中重新用于训练模型、规避使用政策或在法律监管之外运行至关重要。由于当前 GPU 暴露的可信遥测信息有限，且可能被对手修改或虚拟化，我们探讨了在主机和设备均不可信的情况下，基于计算的测量是否能提供可操作的利用率信号。我们引入了一种测量框架，该框架利用现代 GPU 的架构特性，生成与计算活动相关的基于时序和内存的可观测指标。我们的设计借鉴了四种互补的基元：（1）一种受工作量证明（PoW）启发的概率性、工作负载驱动的机制，用于暴露并行计算努力；（2）通过可验证延迟函数（VDF）衍生的顺序性、延迟敏感型工作负载，用于表征标量执行压力；（3）基于通用矩阵乘法（GEMM）的张量核心测量，反映密集线性代数吞吐量；（4）一种 VRAM 驻留测试，通过依赖于带宽的哈希运算区分设备内存局部性与片外访问。这些基元提供了 GPU 参与度的统计和行为指标，即使在没有可信固件、安全飞地或供应商控制的计数器的情况下仍可观测。我们评估了它们在竞争、架构对齐、内存压力和功耗开销下的响应，表明时序偏移和驻留延迟能够揭示有意义的利用率模式。我们的结果说明了基于计算的遥测技术如何通过暴露与部署后 GPU 治理相关的架构信号，来补充未来的问责机制。