生成空间规模：理解与校准大语言模型生成结果的开放性 (Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations)

Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.

翻译：不同的开放式生成任务需要不同程度的输出多样性。然而，当前的大语言模型（LLM）往往存在校准偏差。对于创意性任务，它们倾向于坍缩为过于同质化的输出；而对于事实性任务，它们则会产生多样但不正确的幻觉式回答。我们认为，这两种失效模式可以通过有效生成空间规模（GSS）的概念来统一理解并加以解决——GSS指模型针对一个提示所考虑的在语义上不同的输出集合。我们提出了GSSBench，这是一个包含具有真实GSS关系的提示对的任务套件，用于评估不同指标并理解模型在何处偏离了期望行为。我们发现，幻觉检测指标（尤其是EigenScore）持续优于标准的多样性和不确定性量化指标，并且仅使用模型内部信息，就能为模型的内部任务表征提供可解释的洞察。我们展示了GSS的三个应用方向：(1) 检测提示的模糊性并预测澄清性问题，以实现更好的基础；(2) 解释推理模型中的“过度思考”与“思考不足”现象；(3) 引导模型扩展其生成空间，以产生高质量且多样化的输出。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日