Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose \method{}, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.
翻译:从大型语言模型(LLMs)中生成多样化的响应对于规划/搜索和合成数据生成等应用至关重要,其中多样性要求在不同生成过程中提供不同的答案。先前的方法依赖于提高温度参数以增加多样性。然而,与普遍认知相反,我们证明该方法不仅会随着温度升高导致单个生成结果的质量下降,而且其效果依赖于模型的下一个词元概率与真实答案分布的相似性。我们提出 \method{} 作为一种替代方法,该方法利用语言模型本身将空间划分为多个层级。在推理时,随机选择一个层级并从该层级内抽取样本。为衡量多样性,我们引入了 CoverageQA 数据集,该数据集包含具有多个同等合理答案的未充分定义问题,并通过计算输出分布与有效真实答案均匀分布之间的 KL 散度来评估多样性。由于无法计算专有模型对每个响应/解决方案的概率,我们通过测量对真实解决方案的召回率进行评估。实验结果表明,使用 SimpleStrat 相比 GPT-4o 实现了 0.05 的召回率提升,相比 Llama 3 平均降低了 0.36 的 KL 散度。