EmbedAgent: Benchmarking Large Language Models in Embedded System Development

Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration.Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% [email protected], we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.

翻译：大语言模型（LLMs）已在多种任务中展现出潜力，但鲜有基准测试评估其在嵌入式系统开发中的能力。本文提出EmbedAgent，一种旨在模拟嵌入式系统开发中真实角色（如嵌入式系统程序员、架构师和集成工程师）的范式。该范式使LLMs能够在连接数字系统与物理系统的任务中进行测试，从而对其能力进行更全面的评估。为评估LLMs在此类任务上的表现，我们提出了首个面向嵌入式系统编程、电路设计和跨平台迁移的综合基准测试Embedbench。Embedbench包含126个测试案例，覆盖3种硬件平台上的9类电子元件。通过对10个主流LLMs的广泛实验，我们获得了若干关键发现。令人惊讶的是，尽管案例相对简单，DeepSeek-R1在提供电路原理图信息时仅达到55.6%的pass@1通过率，而在需要自行生成原理图时通过率仅为50.0%。在跨平台迁移任务中，LLMs在Raspberry Pi Pico平台上使用MicroPython时表现出相对较强的性能（最优模型达到73.8% pass@1），但在ESP-IDF平台上表现欠佳，最佳模型仅获得29.4% pass@1。有趣的是，我们观察到通用对话型LLMs（如DeepSeek-V3）往往未能有效利用该领域的预训练知识，而推理型LLMs则倾向于过度思考并忽略预训练阶段已掌握的高效知识。基于这些发现，我们提出两种提升策略：检索增强生成与编译器反馈机制。这些策略显著改善了模型性能，使DeepSeek-R1在提供正确原理图时pass@1达到65.1%，无原理图时达到53.1%。此外，Arduino向ESP32迁移任务的准确率从21.4%提升至27.8%。