Testing Large Language Models on Driving Theory Knowledge and Skills for Connected Autonomous Vehicles

Handling long tail corner cases is a major challenge faced by autonomous vehicles (AVs). While large language models (LLMs) hold great potentials to handle the corner cases with excellent generalization and explanation capabilities and received increasing research interest on application to autonomous driving, there are still technical barriers to be tackled, such as strict model performance and huge computing resource requirements of LLMs. In this paper, we investigate a new approach of applying remote or edge LLMs to support autonomous driving. A key issue for such LLM assisted driving system is the assessment of LLMs on their understanding of driving theory and skills, ensuring they are qualified to undertake safety critical driving assistance tasks for CAVs. We design and run driving theory tests for several proprietary LLM models (OpenAI GPT models, Baidu Ernie and Ali QWen) and open-source LLM models (Tsinghua MiniCPM-2B and MiniCPM-Llama3-V2.5) with more than 500 multiple-choices theory test questions. Model accuracy, cost and processing latency are measured from the experiments. Experiment results show that while model GPT-4 passes the test with improved domain knowledge and Ernie has an accuracy of 85% (just below the 86% passing threshold), other LLM models including GPT-3.5 fail the test. For the test questions with images, the multimodal model GPT4-o has an excellent accuracy result of 96%, and the MiniCPM-Llama3-V2.5 achieves an accuracy of 76%. While GPT-4 holds stronger potential for CAV driving assistance applications, the cost of using model GPT4 is much higher, almost 50 times of that of using GPT3.5. The results can help make decision on the use of the existing LLMs for CAV applications and balancing on the model performance and cost.

翻译：处理长尾极端场景是自动驾驶车辆面临的主要挑战。尽管大型语言模型凭借其卓越的泛化与解释能力在应对极端场景方面展现出巨大潜力，且在自动驾驶领域的应用研究日益增多，但仍存在亟待解决的技术障碍，例如LLMs严格的性能要求与巨大的计算资源需求。本文研究了一种应用远程或边缘LLMs支持自动驾驶的新方法。此类LLM辅助驾驶系统的关键问题在于评估LLMs对驾驶理论与技能的理解程度，确保其具备为网联自动驾驶车辆执行安全关键驾驶辅助任务的资质。我们设计并运行了驾驶理论测试，针对多个专有LLM模型（OpenAI GPT系列、百度文心一言与阿里通义千问）及开源LLM模型（清华MiniCPM-2B与MiniCPM-Llama3-V2.5），使用了超过500道选择题进行理论测试。实验测量了模型准确率、成本与处理延迟。实验结果表明：GPT-4模型凭借提升的领域知识通过测试，文心一言准确率达85%（略低于86%的通过阈值），而包括GPT-3.5在内的其他LLM模型均未通过测试。对于含图像的测试题，多模态模型GPT-4o取得了96%的优异准确率，MiniCPM-Llama3-V2.5则达到76%的准确率。尽管GPT-4在CAV驾驶辅助应用中展现出更强潜力，但其使用成本远高于GPT-3.5，近乎后者的50倍。本研究结果可为CAV应用中选择现有LLMs提供决策依据，并在模型性能与成本之间实现平衡。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日