Large language models (LLMs) have demonstrated impressive capabilities in natural language generation. However, their output quality can be inconsistent, posing challenges for generating natural language from logical forms (LFs). This task requires the generated outputs to embody the exact semantics of LFs, without missing any LF semantics or creating any hallucinations. In this work, we tackle this issue by proposing a novel generate-and-rerank approach. Our approach involves initially generating a set of candidate outputs by prompting an LLM and subsequently reranking them using a task-specific reranker model. In addition, we curate a manually collected dataset to evaluate the alignment between different ranking metrics and human judgements. The chosen ranking metrics are utilized to enhance the training and evaluation of the reranker model. By conducting extensive experiments on three diverse datasets, we demonstrate that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.
翻译:大型语言模型在自然语言生成方面展现了令人瞩目的能力。然而,其输出质量可能不稳定,这为从逻辑形式生成自然语言带来了挑战。该任务要求生成输出准确体现逻辑形式的语义,既不遗漏任何逻辑形式语义,也不产生任何幻觉。在本研究中,我们通过提出一种新颖的生成-重排序方法来解决这一问题。我们的方法包括:首先通过提示大型语言模型生成一组候选输出,随后使用任务特定的重排序模型对这些候选输出进行重排序。此外,我们整理了一个人工收集的数据集,以评估不同排序指标与人类判断之间的一致性。所选的排序指标被用于优化重排序模型的训练与评估。通过在三个多样化数据集上进行广泛实验,我们证明了由重排序模型选择的候选输出在语义一致性与流畅性方面,通过三个综合指标衡量,均优于基线方法选择的输出。我们的研究结果有力证明了该方法在提升生成输出质量方面的有效性。