Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development. In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration. Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% pass@1. Interestingly, we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.
翻译:大语言模型(LLMs)已在多种任务中展现出潜力,但目前鲜有基准测试能够评估其在嵌入式系统开发中的能力。本文提出EmbedAgent,一种旨在模拟嵌入式系统开发中真实角色的范式,例如嵌入式系统程序员、架构师和集成工程师。该范式使得大语言模型能够在弥合数字系统与物理系统之间鸿沟的任务中进行测试,从而对其能力进行更全面的评估。为了在这些任务上评估大语言模型,我们提出了Embedbench,这是首个针对嵌入式系统编程、电路设计和跨平台迁移的综合基准测试。Embedbench包含126个测试用例,覆盖3个硬件平台上的9种电子元件。通过对10个主流大语言模型进行大量实验,我们获得了若干关键发现。令人惊讶的是,尽管测试用例相对简单,但DeepSeek-R1在提供电路图信息的情况下,其pass@1通过率仅为55.6%;而在需要其自行生成电路图的任务中,通过率仅为50.0%。在跨平台迁移任务中,大语言模型在Raspberry Pi Pico平台上使用MicroPython时表现出相对较强的性能(最佳模型的pass@1通过率达到73.8%),但在ESP-IDF平台上表现不佳,最佳模型的pass@1通过率仅为29.4%。有趣的是,我们观察到像DeepSeek-V3这样的通用对话大语言模型常常无法有效利用其在该领域预训练中获得的相关知识,而推理型大语言模型则倾向于过度思考并忽视预训练阶段已习得的高效知识。基于这些观察,我们提出了两种策略:检索增强生成和编译器反馈,以提升大语言模型的性能。这些策略带来了显著的改进,DeepSeek-R1在提供正确电路图的情况下,pass@1通过率提升至65.1%,在无电路图的情况下提升至53.1%。此外,Arduino到ESP32的迁移任务准确率也从21.4%提升至27.8%。