Large language models (LLMs) have demonstrated an impressive ability to synthesize plausible and fluent text. However they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be time-consuming and difficult. This paper proposes symbolically grounded generation (SymGen) as a simple approach for enabling easier validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across data-to-text and question answering experiments, we find that LLMs are able to directly output text that makes use of symbolic references while maintaining fluency and accuracy.
翻译:大型语言模型(LLMs)在生成流畅且可信的文本方面展现出了令人印象深刻的能力。然而,它们仍然容易产生幻觉,因此在高风险应用中,其输出通常需要人工验证,这一过程既耗时又困难。本文提出符号化基础生成(SymGen)作为一种简化方法,以便于对LLM输出进行验证。SymGen提示LLM在其常规输出文本中穿插显式符号引用,这些引用指向某些条件数据(例如,JSON格式表格)中的字段。这些引用可用于展示生成文本中不同片段的来源,从而减少人工验证所需的工作量。通过数据到文本和问答实验,我们发现LLMs能够直接生成使用符号引用的文本,同时保持流畅性和准确性。