Prior work has shown that the ordering in which concepts are shown to a commonsense generator plays an important role, affecting the quality of the generated sentence. However, it remains a challenge to determine the optimal ordering of a given set of concepts such that a natural sentence covering all the concepts could be generated from a pretrained generator. To understand the relationship between the ordering of the input concepts and the quality of the generated sentences, we conduct a systematic study considering multiple language models (LMs) and concept ordering strategies. We find that BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data as measured using multiple evaluation metrics. Moreover, the larger GPT3-based large language models (LLMs) variants do not necessarily outperform much smaller LMs on this task, even when fine-tuned on task-specific training data. Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for the generation, outperforming a probabilistic concept ordering baseline
翻译:先前的研究表明,在常识生成任务中,概念被展示给生成器的顺序对生成句子的质量具有重要影响。然而,针对给定概念集确定最佳顺序仍是一个挑战,使得预训练生成器能够生成涵盖所有概念的自然句子。为理解输入概念顺序与生成句子质量之间的关系,我们开展了系统性研究,考虑了多种语言模型和概念排序策略。研究发现,当使用CommonGen训练数据中概念出现的顺序进行微调时,BART-large模型在多项评估指标上始终优于本研究考虑的其他所有语言模型。此外,即使基于任务特定训练数据进行微调,更大的基于GPT3的大语言模型变体在此任务上也不一定优于更小的语言模型。有趣的是,人类注释者在手动撰写涵盖这些概念的句子时会显著重新排序输入的概念集,且这种排序能独立于生成所用的语言模型产生最优的句子生成结果,超越了概率性概念排序基线。