Background. Large language models are increasingly used in settings where confident but incorrect answers can mislead users. Reliable uncertainty communication requires a form of metacognition: monitoring when one's own answers are likely to be correct. Yet models' stated confidence is often poorly aligned with answer correctness. We test whether supervised fine-tuning improves uncertainty communication and whether gains transfer across domains and task formats. Methods. We fine-tuned two models on general knowledge, mathematics, and open-ended trivia questions. We evaluated single-question confidence estimation, in which the model reports numeric confidence for one answer, and pairwise confidence comparison, in which it chooses which of two questions it is more likely to answer correctly. We tested held-out questions from training domains and new medical, legal, and truthfulness benchmarks. We assessed calibration, discrimination, and answer accuracy before and after fine-tuning. Results. Here we show that fine-tuning improves alignment between stated confidence and observed accuracy and increases the model's ability to assign higher confidence to correct than to incorrect answers. Gains occur within training domains and, to a lesser extent, in new domains. However, single-task training does not reliably transfer between single-question confidence estimation and pairwise confidence comparison. Multitask fine-tuning produces broader gains in the models and tasks studied here. Conclusions. Uncertainty communication in large language models is trainable, but transfer across metacognitive tasks is limited. Joint training on multiple confidence tasks may support broader generalization, although further tests across model families and metacognitive tasks are needed.
翻译:暂无翻译