Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
翻译:大型语言模型(LLMs)普遍存在校准偏差问题,在指令微调和偏好对齐后尤为显著。虽然修改训练目标可以改善校准效果,但重新训练成本高昂。推理时引导提供了一种轻量级替代方案,然而现有方法大多优化正确性的代理指标而非正确性本身。本文提出CORAL(基于正确性优化的残差激活透镜),这是一种正则化的推理时引导方法,通过权重衰减多层感知器探针从模型内部激活中捕获分布式正确性信号。我们在三个70亿参数模型上评估CORAL,发现其平均将准确率提升10%,预期校准误差(ECE)降低50%。进一步实验表明,这些改进无需重新训练即可迁移至四个保留基准测试(ARC-Challenge、HellaSwag、Math-MC、OpenBookQA)的完整已发布测试集,平均实现14%的准确率提升和49%的ECE改善。实验结果支持以下假设:当单个神经元信息不足时,可通过正则化探针提取模型内部的分布式信息。因此,CORAL为提升推理阶段多项选择题问答性能提供了一种计算高效、可迁移且具备校准感知能力的方法。