Recent advancement in the capabilities of large language models (LLMs) has triggered a new surge in LLMs' evaluation. Most recent evaluation works tends to evaluate the comprehensive ability of LLMs over series of tasks. However, the deep structure understanding of natural language is rarely explored. In this work, we examine the ability of LLMs to deal with structured semantics on the tasks of question answering with the help of the human-constructed formal language. Specifically, we implement the inter-conversion of natural and formal language through in-context learning of LLMs to verify their ability to understand and generate the structured logical forms. Extensive experiments with models of different sizes and in different formal languages show that today's state-of-the-art LLMs' understanding of the logical forms can approach human level overall, but there still are plenty of room in generating correct logical forms, which suggest that it is more effective to use LLMs to generate more natural language training data to reinforce a small model than directly answering questions with LLMs. Moreover, our results also indicate that models exhibit considerable sensitivity to different formal languages. In general, the formal language with the lower the formalization level, i.e. the more similar it is to natural language, is more LLMs-friendly.
翻译:大语言模型能力的近期进展引发了对其评估的新热潮。现有评估工作多侧重于模型在系列任务上的综合能力,但对其深层自然语言结构理解的探索尚显不足。本研究借助人工构建的形式化语言,通过问答任务考察大语言模型处理结构化语义的能力。具体而言,我们利用大语言模型的上下文学习能力实现自然语言与形式化语言的相互转换,以验证其对结构化逻辑形式的理解与生成能力。基于不同规模模型及多种形式化语言的广泛实验表明:当前最优大语言模型对逻辑形式的理解能力整体已接近人类水平,但在生成正确逻辑形式方面仍存在显著提升空间——这揭示出更有效的策略并非直接使用大语言模型回答问题,而是利用其生成更多自然语言训练数据来强化小型模型。此外,研究结果还显示模型对不同形式化语言具有显著的敏感性。总体而言,形式化程度越低(即与自然语言越相似的形式化语言)对大语言模型越友好。