Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.
翻译:大语言模型(LLMs)通常被训练为沿时间正向进行预测。然而,近期研究表明,提示这些模型回溯并批判其自身生成内容可产生有价值的反馈。受此启发,我们探讨大语言模型是否能够通过逆向思维(预测与评分)来提供补充正向大语言模型的无监督反馈。为此,我们提出了时间反转语言模型(TRLMs),该模型能够在以回答为条件时对查询进行评分与生成,实现在时间反方向上的有效运作。此外,为在回答到查询方向上实现有效推理,我们从头开始以逆向词元顺序预训练并微调了一个语言模型(TRLM-Ba)。我们通过实证(及在特定场景下的理论分析)证明,当使用时间反转模型对给定回答的查询进行评分以重排序多个正向生成结果时,其确实能够补充正向模型的预测。在广泛使用的AlpacaEval排行榜上,相较于基于自对数困惑度得分的最佳N重排序这一强基线方法,我们实现了高达5%的性能提升。我们进一步证明,TRLM评分优于传统的给定查询的回答正向评分方法,在引文生成与段落检索等应用中带来显著增益。随后,我们利用TRLM的生成能力来增强大语言模型的输入安全过滤器或为其提供无监督反馈,实验表明在流行的JailbreakBench排行榜发布的多种攻击场景下,该方法能大幅降低假阴性率,同时对假阳性率的影响可忽略不计。