Quality Is Not a Safety Proxy Under Quantization

Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39$\times$ more quantization error overall ($p < 5 \times 10^{-7}$), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9\% of rows ($κ= 0.873$, 95\% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95\% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

翻译：量化检查点通常先通过质量指标进行筛选，随后（若有的话）才进行直接安全测试。本文在匹配的51行矩阵上对该捷径进行了审计，该矩阵涵盖6个模型、4个系列、7级GGUF阶梯以及AWQ/GPTQ INT4检查点。在此矩阵中，该捷径失效：所有36个质量-安全配对在模型间方向分裂，9个隐藏危险行和1个近似隐藏危险行显示质量稳定或改善，而拒绝率下降12-68个百分点。11个AWQ/GPTQ行中有7个为隐藏危险行。基于17个Hugging Face支持的FP16/AWQ/GPTQ单元格的四探针机制跟踪未能挽救该捷径：熵探针、拒绝方向探针和校准探针对危险行的区分能力弱或无效，尽管探针识别的安全相关神经元总体上吸收了1.39倍更多的量化误差（p < 5 × 10^{-7}），但该效应不具有特定机制特异性。Claude Sonnet 4对预定义分层集中的11,470项进行重新标注，与主评判器gemma3:12b在89.9%的行上一致（κ=0.873，95% CI [0.866, 0.881]），且未改变0/10个隐藏危险单元格。一个经过校准的研究内部行为屏幕——拒绝模板稳定性指数（RTSI），由四个拒绝模板漂移特征构建并在该矩阵上校准——将10/10个隐藏或近似隐藏危险行导向直接安全测试（Wilson 95% CI下限0.72），同时在样本内评分和行级留一验证下，将45个非基线行中的23个保留在低风险桶中；在同一矩阵上，最佳单特征基线（唯一前缀率差异、原始拒绝率差异）在匹配桶大小下分别恢复9/10和8/10，且跨堆栈迁移需要重新校准。对于本文研究的量化检查点、模型系列和安全结果，保留的质量不能豁免直接安全评估。