We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to functional language agents demands that models not only master human interaction, reasoning, and planning but also ensure grounding in the relevant environments. This calls for a harmonious blend of language and coding capabilities in the models. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pre-training using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks among open-source models. Comprehensive experiments demonstrate Lemur's superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. https://github.com/OpenLemur/Lemur
翻译:我们提出Lemur与Lemur-Chat系列开源语言模型,该模型针对自然语言与代码能力进行联合优化,旨在成为通用语言智能体的核心基础架构。从语言聊天模型向功能性语言智能体的演进要求模型不仅掌握人类交互、推理与规划能力,更要确保在相关环境中的具身性。这需要模型在语言能力与代码能力之间实现和谐融合。Lemur与Lemur-Chat正是为应对这一需求而设计,展现出两领域均衡的专业素养,这与现有聚焦单一领域的开源模型形成鲜明对比。通过基于代码密集型语料的精细化预训练,以及文本与代码数据的指令微调,我们的模型在各类文本与代码基准测试中实现了开源模型的最优平均性能。全面实验表明,Lemur在开源模型中具有显著优势,并在涉及人类沟通、工具使用及完全/部分可观测环境交互的多种智能体任务中展现出卓越能力。自然语言与编程语言的融合使Lemur-Chat能够显著缩小与闭源模型在智能体能力上的差距,为开发具备跨环境推理、规划与无缝操作能力的高级开源智能体提供关键洞见。项目地址:https://github.com/OpenLemur/Lemur