Although Large Language Models (LLMs) have shown great potential in Natural Language Generation, it is still challenging to characterize the uncertainty of model generations, i.e., when users could trust model outputs. Our research is derived from the heuristic facts that tokens are created unequally in reflecting the meaning of generations by auto-regressive LLMs, i.e., some tokens are more relevant (or representative) than others, yet all the tokens are equally valued when estimating uncertainty. It is because of the linguistic redundancy where mostly a few keywords are sufficient to convey the meaning of a long sentence. We name these inequalities as generative inequalities and investigate how they affect uncertainty estimation. Our results reveal that considerable tokens and sentences containing limited semantics are weighted equally or even heavily when estimating uncertainty. To tackle these biases posed by generative inequalities, we propose to jointly Shifting Attention to more Relevant (SAR) components from both the token level and the sentence level while estimating uncertainty. We conduct experiments over popular "off-the-shelf" LLMs (e.g., OPT, LLaMA) with model sizes up to 30B and powerful commercial LLMs (e.g., Davinci from OpenAI), across various free-form question-answering tasks. Experimental results and detailed demographic analysis indicate the superior performance of SAR. Code is available at https://github.com/jinhaoduan/shifting-attention-to-relevance.
翻译:尽管大语言模型(LLMs)在自然语言生成领域展现出巨大潜力,但表征模型生成内容的不确定性(即用户何时能够信任模型输出)仍具挑战性。本研究源于一个启发式事实:自回归式LLMs中,不同token在反映生成内容的语义时具有不平等性——部分token比其他更相关(或更具代表性),但传统不确定性估计却对所有token一视同仁。这种现象源于语言冗余性,通常只需少数关键词即可传达长句的核心语义。我们将这种不平等性定义为"生成不平等",并探究其对不确定性估计的影响。研究结果表明,大量语义有限的token和句子在不确定性估计中被赋予同等甚至过高权重。为解决生成不平等引发的偏差,我们提出在不确定性估计过程中,同时在token层级和句子层级联合将注意力转向更相关(SAR)的组件。我们在主流"开箱即用"型LLMs(如OPT、LLaMA,模型参数规模达30B)及强大的商业LLMs(如OpenAI的Davinci)上,针对多种自由形式的问答任务开展实验。实验结果与详细的人口统计学分析表明SAR具有卓越性能。代码已开源至https://github.com/jinhaoduan/shifting-attention-to-relevance。