Large language models (LLMs) have been shown to possess impressive capabilities, while also raising crucial concerns about the faithfulness of their responses. A primary issue arising in this context is the management of unanswerable queries by LLMs, which often results in hallucinatory behavior, due to overconfidence. In this paper, we explore the behavior of LLMs when presented with unanswerable queries. We ask: do models \textbf{represent} the fact that the question is unanswerable when generating a hallucinatory answer? Our results show strong indications that such models encode the answerability of an input query, with the representation of the first decoded token often being a strong indicator. These findings shed new light on the spatial organization within the latent representations of LLMs, unveiling previously unexplored facets of these models. Moreover, they pave the way for the development of improved decoding techniques with better adherence to factual generation, particularly in scenarios where query unanswerability is a concern.
翻译:摘要:大型语言模型(LLMs)已被证明具有令人印象深刻的能力,同时也引发了关于其回应忠实性的关键担忧。在此背景下出现的一个主要问题是LLMs对不可回答查询的管理,由于过度自信,这常导致幻觉行为。本文探讨了当面对不可回答查询时LLMs的行为。我们提出疑问:模型在生成幻觉性答案时是否**表征**了问题不可回答这一事实?我们的结果强烈表明,此类模型编码了输入查询的可回答性,且首个解码标记的表征通常是一个强有力的指标。这些发现为LLMs潜在表征中的空间组织提供了新的启示,揭示了这些模型中此前未被探索的方面。此外,它们为开发改进的解码技术铺平了道路,这些技术能更好地遵循事实生成,尤其在查询不可回答性成为问题的场景中。