Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems, which requires cross-domain knowledge of hardware and software has not been studied. In this paper we develop an extensible, open source hardware-in-the-loop framework to systematically evaluate leading LLMs (GPT-3.5, GPT-4, PaLM 2) to assess their capabilities and limitations for embedded system development. We observe through our study that even when these tools fail to produce working code, they consistently generate helpful reasoning about embedded design tasks. We leverage this finding to study how human programmers interact with these tools, and develop an human-AI based software engineering workflow for building embedded systems. Our evaluation platform for verifying LLM generated programs uses sensor actuator pairs for physical evaluation. We compare all three models with N=450 experiments and find surprisingly that GPT-4 especially shows an exceptional level of cross-domain understanding and reasoning, in some cases generating fully correct programs from a single prompt. In N=50 trials, GPT-4 produces functional I2C interfaces 66% of the time. GPT-4 also produces register-level drivers, code for LoRa communication, and context-specific power optimizations for an nRF52 program resulting in over 740x current reduction to 12.2uA. We also characterize the models' limitations to develop a generalizable human-AI workflow for using LLMs in embedded system development. We evaluate our workflow with 15 users including novice and expert programmers. We find that our workflow improves productivity for all users and increases the success rate for building a LoRa environmental sensor from 25% to 100%, including for users with zero hardware or C/C++ experience.
翻译:大语言模型(LLMs)在代码生成方面展现出卓越能力,但其在需要跨领域软硬件知识的嵌入式系统开发中的软件构建能力尚未得到系统研究。本文提出一种可扩展的开源硬件在环框架,系统评估主流LLMs(GPT-3.5、GPT-4、PaLM 2)在嵌入式系统开发中的能力边界。研究发现,即使这些工具无法生成可运行代码,它们仍能持续输出关于嵌入式设计任务的有效推理。基于此发现,我们研究了人类程序员与这些工具的交互模式,并构建了面向嵌入式系统开发的人机协同软件工程工作流。验证平台采用传感器-执行器对进行物理级评估。通过N=450组实验对比三种模型,发现GPT-4展现出跨领域理解与推理的突出能力,在单次提示下即可生成完全正确的程序。在N=50次试验中,GPT-4生成功能型I2C接口的成功率达66%。该模型还能生成寄存器级驱动程序、LoRa通信代码及nRF52程序的上下文感知功耗优化方案,使电流消耗降低至12.2μA(降幅超过740倍)。我们进一步刻画模型局限性,构建了适用于嵌入式系统开发的通用化人机协同工作流。通过15名用户(含新手与专家程序员)的评估实验,验证该工作流能提升所有用户的生产力,并将LoRa环境传感器构建成功率从25%提升至100%,其中包含无硬件或C/C++经验的零基础用户。