Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems, which requires cross-domain knowledge of hardware and software has not been studied. In this paper we systematically evaluate leading LLMs (GPT-3.5, GPT-4, PaLM 2) to assess their performance for embedded system development, study how human programmers interact with these tools, and develop an AI-based software engineering workflow for building embedded systems. We develop an an end-to-end hardware-in-the-loop evaluation platform for verifying LLM generated programs using sensor actuator pairs. We compare all three models with N=450 experiments and find surprisingly that GPT-4 especially shows an exceptional level of cross-domain understanding and reasoning, in some cases generating fully correct programs from a single prompt. In N=50 trials, GPT-4 produces functional I2C interfaces 66% of the time. GPT-4 also produces register-level drivers, code for LoRa communication, and context-specific power optimizations for an nRF52 program resulting in over 740x current reduction to 12.2 uA. We also characterize the models' limitations to develop a generalizable workflow for using LLMs in embedded system development. We evaluate the workflow with 15 users including novice and expert programmers. We find that our workflow improves productivity for all users and increases the success rate for building a LoRa environmental sensor from 25% to 100%, including for users with zero hardware or C/C++ experience.
翻译:大型语言模型(LLMs)在代码生成方面展现出卓越能力,但其在嵌入式系统软件开发中的表现尚未得到研究——这类开发需要跨硬件与软件领域的知识。本文系统评估了主流LLMs(GPT-3.5、GPT-4、PaLM 2)在嵌入式系统开发中的性能,研究了人类程序员与这些工具的交互方式,并构建了基于人工智能的嵌入式系统开发软件工程工作流。我们开发了一个端到端的硬件在环评估平台,利用传感器-执行器对验证LLM生成的程序。通过N=450个实验对比三种模型,意外发现GPT-4展现出卓越的跨领域理解与推理能力,在某些情况下仅凭单次提示即可生成完全正确的程序。在N=50次试验中,GPT-4生成功能性I2C接口的成功率达66%。GPT-4还生成了寄存器级驱动程序、LoRa通信代码,以及针对nRF52程序的上下文感知功耗优化方案,使电流消耗降低超过740倍至12.2微安。我们进一步刻画了模型的局限性,开发了在嵌入式系统开发中应用LLMs的通用工作流。通过15名用户(包括新手和专家程序员)评估该工作流,发现其能提升所有用户的生产力,并将构建LoRa环境传感器的成功率从25%提升至100%,甚至对零硬件或C/C++经验的用户同样有效。