Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5--7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the final answer. These regularities suggest a general intervention principle: restore local token-level margins exactly at the earliest failure frontier. We instantiate this principle as a lightweight measure$\rightarrow$locate$\rightarrow$restore loop that operates directly on the quantized model: detect the first faulty step, construct our "Silver Bullet" datasets, and apply small-scale supervised/preference tuning. In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline while preserving PTQ efficiency. Our framework is quantizer- and architecture-agnostic within the evaluated regimes, and turns low-bit degradation from a global accuracy problem into a local, reproducible process intervention.
翻译:低比特后训练量化(PTQ)是在严格内存与延迟预算下部署具备推理能力的大语言模型的实用途径,然而它会显著损害数学推理能力(在我们更严苛的设置中性能下降高达69.81%)。我们以过程级精度探讨两个部署关键问题:在结构化分步解答的哪个环节首次出现性能退化?如何在保持低比特体制的同时缓解该问题?基于广泛使用的PTQ方法(AWQ、GPTQ、SmoothQuant)、开源模型系列(Qwen、LLaMA;0.5–7B)及数学推理基准(GSM8K、MATH、AIME),我们通过格式对齐的思维链与步骤对齐的归因分析,揭示出两个稳健规律:(i)相较于高层概念错误,PTQ会不成比例地增加方法与执行错误;(ii)故障出现较早,首个脆弱步骤的翻转会级联影响最终答案。这些规律提示了一个通用干预原则:在最早失效边界精确恢复局部词元级裕度。我们将该原则实例化为一个直接在量化模型上运行的轻量级“测量→定位→恢复”循环:检测首个错误步骤,构建我们的“银弹”数据集,并应用小规模监督/偏好微调。在我们的设置中,仅需332个精选样本和单GPU上3–5分钟的计算,即可使4比特权重数学推理能力恢复至全精度基线水平,同时保持PTQ的效率。我们的框架在所评估范围内与量化器及架构无关,并将低比特性能退化问题从全局精度挑战转化为局部化、可复现的过程干预。