Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.
翻译:大型语言模型(LLMs)主要在代码和图等符号表示上进行训练和测试,然而现实世界的用户任务通常以自然语言形式指定。LLMs在多大程度上能够跨这些表示进行泛化?本文通过研究涉及代码、图和自然语言表示的过程性同构任务(例如规划中的调度步骤)来探讨这一问题。我们发现,仅使用图或代码数据通过流行的后训练方法训练LLMs,并不能可靠地泛化到相应的自然语言任务;而仅使用自然语言训练则可能导致性能提升效率低下。为弥补这一差距,我们提出了一种两阶段数据课程:首先在符号数据上训练,随后在自然语言数据上训练。该课程显著提升了不同模型系列和任务间的模型性能。值得注意的是,通过我们的方法训练的1.5B参数Qwen模型在自然主义规划任务中能够接近零样本GPT-4o的表现。最后,我们的分析表明,成功的跨表示泛化可被解释为一种生成类比的形式,而我们的课程有效促进了这种能力的形成。