Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks. However, recent research has highlighted their sensitivity to variations in input prompts. To deploy LLMs in a safe and reliable manner, it is crucial for their outputs to be consistent when prompted with expressions that carry the same meaning or intent. While some existing work has explored how state-of-the-art LLMs address this issue, their evaluations have been confined to assessing lexical equality of single- or multi-word answers, overlooking the consistency of generative text sequences. For a more comprehensive understanding of the consistency of LLMs in open-ended text generation scenarios, we introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs. Our proposal demonstrates significantly higher consistency and stronger correlation with human evaluations of output consistency than traditional metrics based on lexical consistency. Finally, we propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency. When evaluated for closed-book question answering based on answer variations from the TruthfulQA benchmark, A2C increases accuracy metrics for pretrained and finetuned LLMs by up to 47%, and semantic consistency metrics for instruction-tuned models by up to 7-fold.
翻译:大型语言模型(LLM)在各种自然语言任务中展现出显著的流畅性和能力。然而,近期研究强调了其对输入提示变化的敏感性。为了安全可靠地部署LLM,当使用表达相同含义或意图的提示时,其输出保持一致至关重要。尽管已有一些工作探索了最先进LLM如何处理这一问题,但它们的评估局限于检查单词或多词答案的词汇等同性,忽略了生成文本序列的一致性。为了更全面地理解LLM在开放式文本生成场景中的一致性,我们引入了一种通用的语义一致性度量,并制定了该指标的多个版本以评估各种LLM的性能。我们提出的方法比基于词汇一致性的传统指标展现出显著更高的一致性,且与人类评估的输出一致性相关性更强。最后,我们提出了一种新颖的提示策略,称为“Ask-to-Choose(A2C)”,以增强语义一致性。在基于TruthfulQA基准中答案变体的闭卷问答评估中,A2C使预训练和微调LLM的准确率指标提升了高达47%,并使指令微调模型的语义一致性指标提升了高达7倍。