Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).
翻译:校准(Calibration)通过建立模型置信度与准确率之间的相关性,对大语言模型(LLM)的发展具有重要意义。我们基于自洽性(Self-Consistency, Wang et al., 2022)设计了三种可直接使用的校准方法,用于数学推理任务。在GSM8K与MathQA两个主流基准测试中,使用强开源LLM(Mistral与LLaMA2)进行评估,结果表明,相较于基于p(True)(Kadavath et al., 2022)或logit(Kadavath et al., 2022)的现有方法,我们的方法能更有效地桥接模型置信度与准确率。