A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.
翻译:一种扩展推理时计算的标准技术是自洽性,即从大语言模型中采样多个候选答案并选择出现频率最高的答案。近期研究表明,加权多数投票(例如置信度感知自洽性(CISC))能够为每个候选答案分配置信度值并选择累积得分最高的答案,在多种广泛使用的基准测试中往往具有更高的准确性。然而在实践中,加权多数投票需调用评判模型对每个候选答案的推理轨迹进行评估以生成置信度分数。尽管这一系列二次调用能带来性能提升,但显著增加了加权多数投票的开销和成本。为降低该成本,我们提出VecCISC——一种轻量级自适应框架,通过语义相似度度量过滤与其它轨迹语义等价、退化或产生幻觉的推理轨迹,从而减少需经评判模型评估的候选答案数量。为确保实验的充分性,我们在涵盖数学、化学、生物学、常识推理及人文学科领域的五个具有挑战性的广泛采用数据集上评估VecCISC。结果表明,VecCISC在保持或超越CISC准确率的同时,将总令牌使用量降低了47%。