Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dynamic quantization operates on tensors at run-time, adapting its parameters to the actual input data. Today's mainstream machine learning frameworks, including ML compilers and inference engines, frequently recommend dynamic quantization as an initial step for optimizing model serving. This is because dynamic quantization can significantly reduce memory usage and computational load, leading to faster token generation and improved model serving efficiency without substantial loss in model accuracy. In this paper, we reveal a critical vulnerability in dynamic quantization: an adversary can exploit such quantization strategy to steal sensitive user data placed in the same batch as the adversary's input. Our analysis demonstrates that dynamic quantization, when improperly implemented or configured, can create side channels that expose information about other inputs within the same batch. We call this phenomenon Quantamination, describing contamination from quantization. Specifically, we show that at least 4 of the most popular ML frameworks in use today either default to or can use configurations that leak data across the batch boundary. This data leakage, in theory, allows attackers to partially or even fully recover other users' batched input data, representing a serious privacy risk for existing ML serving frameworks.
翻译:动态量化已成为提升机器学习服务流程利用率与效率的实用方法。与离线应用的静态量化不同,动态量化在运行时对张量进行操作,根据实际输入数据自适应调整参数。当前主流机器学习框架(包括ML编译器和推理引擎)常将动态量化作为优化模型服务的初始步骤。这是因为动态量化可显著降低内存占用与计算负载,从而加速令牌生成、提升模型服务效率,且不会导致模型精度大幅下降。本文揭示动态量化中存在关键漏洞:攻击者可利用此类量化策略窃取与自身输入同批次的其他用户敏感数据。我们的分析表明,若动态量化实施或配置不当,将产生侧信道,暴露同批次内其他输入信息。我们将此现象称为“量化污染”(Quantamination),即由量化导致的污染。具体而言,我们证实当前最主流的ML框架中至少有4个在默认或可配置状态下存在跨批次边界的数据泄露。理论上,此类数据泄露能使攻击者部分甚至完全恢复其他用户批次化的输入数据,对现有ML服务框架构成严重隐私风险。