To be included into chatbot systems, Large language models (LLMs) must be aligned with human conversational conventions. However, being trained mainly on web-scraped data gives existing LLMs a voice closer to informational text than actual human speech. In this paper, we examine the effect of decoding methods on the alignment between LLM-generated and human conversations, including Beam Search, Top K Sampling, and Nucleus Sampling. We present new measures of alignment in substance, style, and psychometric orientation, and experiment with two conversation datasets. Our results provide subtle insights: better alignment is attributed to fewer beams in Beam Search and lower values of P in Nucleus Sampling. We also find that task-oriented and open-ended datasets perform differently in terms of alignment, indicating the significance of taking into account the context of the interaction.
翻译:为融入聊天机器人系统,大型语言模型(LLMs)必须与人类对话规范保持对齐。然而,由于主要基于网络爬取数据训练,现有LLMs的语体更接近信息性文本而非真实人类口语。本文研究了Beam Search、Top K采样和Nucleus采样等解码方法对LLM生成对话与人类对话对齐性的影响。我们提出了内容、风格和心理测量取向三个维度的对齐性评估指标,并在两个对话数据集上进行了实验。研究结果揭示了微妙规律:Beam Search中较少束宽与Nucleus Sampling中较低P值可带来更好的对齐效果。同时发现任务导向型和开放型数据集在对齐性表现上存在差异,这表明必须考虑交互情境的重要性。