Language models (LMs) have already demonstrated remarkable abilities in understanding and generating both natural and formal language. Despite these advances, their integration with real-world environments such as large-scale knowledge bases (KBs) remains an underdeveloped area, affecting applications such as semantic parsing and indulging in "hallucinated" information. This paper is an experimental investigation aimed at uncovering the robustness challenges that LMs encounter when tasked with knowledge base question answering (KBQA). The investigation covers scenarios with inconsistent data distribution between training and inference, such as generalization to unseen domains, adaptation to various language variations, and transferability across different datasets. Our comprehensive experiments reveal that even when employed with our proposed data augmentation techniques, advanced small and large language models exhibit poor performance in various dimensions. While the LM is a promising technology, the robustness of the current form in dealing with complex environments is fragile and of limited practicality because of the data distribution issue. This calls for future research on data collection and LM learning paradims.
翻译:语言模型在理解和生成自然语言及形式化语言方面已展现出卓越能力。然而,尽管取得了这些进展,语言模型与大规模知识库等现实环境的融合仍是一个未充分发展的领域,这影响了语义解析等应用,并导致其容易生成“幻觉”信息。本文通过实验研究,旨在揭示语言模型在知识库问答任务中面临的鲁棒性挑战。研究覆盖了训练与推理阶段数据分布不一致的场景,例如对未见领域的泛化、对不同语言变体的适应性以及跨数据集的迁移能力。我们的综合实验表明,即使采用我们提出的数据增强技术,先进的小型与大型语言模型在多个维度上仍然表现不佳。尽管语言模型是一项前景广阔的技术,但由于数据分布问题,其当前形式在处理复杂环境时的鲁棒性较为脆弱,实际应用价值有限。这呼吁未来在数据收集与语言模型学习范式方面开展进一步研究。