Many applications of large language models (LLMs) require only a narrow capability, yet common post-training quantization (PTQ) pipelines assign precision largely without regard to the target task. As a result, they may spend bits on layers that are less relevant to the task. We propose per-task mixed-precision PTQ guided by hidden representations. Given a small set of unlabeled calibration prompts from the target task, we estimate layer importance and allocate higher precision to task-relevant layers while lower to the rest, under a bits allocation budget. We introduce three task-aware allocation signals: \textbf{TAQ}, which scores layers using an information-stability criterion derived from activation geometry; \textbf{TAQO}, which ranks layers by direct sensitivity to single-layer quantization; and \textbf{TAQ-KL}, which measures output sensitivity via KL divergence under a noise proxy for quantization error. Together, these methods provide a simple, post-training framework that connects mechanistic signals to quantization decisions, enabling task-aligned compression without additional training.
翻译:大语言模型(LLM)的许多应用仅需特定能力,然而常见的训练后量化(PTQ)流程在分配精度时往往未充分考虑目标任务。这导致量化比特可能被分配给与任务相关性较低的层。我们提出一种基于隐层表示的逐任务混合精度训练后量化方法。给定目标任务中的少量无标注校准提示,我们在比特分配预算约束下,通过估计各层重要性,将更高精度分配给任务相关层,较低精度分配给其余层。我们引入了三种任务感知的分配信号:\textbf{TAQ}——基于激活几何结构推导的信息稳定性准则对层进行评分;\textbf{TAQO}——通过单层量化的直接敏感度对层进行排序;\textbf{TAQ-KL}——通过量化误差的噪声代理测量输出敏感度的KL散度。这些方法共同构成了一个简单的训练后框架,将机制信号与量化决策相连接,实现在无需额外训练的情况下进行任务对齐的模型压缩。