Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.
翻译:大型语言生成模型与人类的交互日益频繁,但其虚假回答引发了广泛担忧。为缓解这种幻觉效应,选择性弃答(称为选择性生成)为生成器在不确定答案时控制幻觉提供了一种有效途径。然而,当选择性生成器在对抗性环境下与用户交互,并仅获得关于已选生成内容的局部反馈(例如对所选答案的点赞或点踩)时,在此实际场景下研究选择性生成的学习方法至关重要,但目前尚属空白。为弥补这一不足,我们提出了一种在自适应对抗环境下基于局部反馈的在线选择性生成学习算法。具体而言,我们重新利用对抗性老虎机算法,设计了一种具有可控错误发现率(FDR,用于衡量幻觉发生率)的在线选择性生成方法。其关键构建模块包括:将任意老虎机算法的遗憾值转换为FDR的新颖引理,以及通过利用选择性生成特有的结构实现局部反馈的重复利用(我们称之为反馈解锁)。我们通过多样化学习环境下的实证评估,验证了所提出的基于局部反馈的在线选择性生成算法的有效性。实验表明,与基线方法相比,该算法在维持合理选择效率(即非弃答比例)的同时,能够有效控制FDR。