Handling long tail corner cases is a major challenge faced by autonomous vehicles (AVs). While large language models (LLMs) hold great potentials to handle the corner cases with excellent generalization and explanation capabilities and received increasing research interest on application to autonomous driving, there are still technical barriers to be tackled, such as strict model performance and huge computing resource requirements of LLMs. In this paper, we investigate a new approach of applying remote or edge LLMs to support autonomous driving. A key issue for such LLM assisted driving system is the assessment of LLMs on their understanding of driving theory and skills, ensuring they are qualified to undertake safety critical driving assistance tasks for CAVs. We design and run driving theory tests for several proprietary LLM models (OpenAI GPT models, Baidu Ernie and Ali QWen) and open-source LLM models (Tsinghua MiniCPM-2B and MiniCPM-Llama3-V2.5) with more than 500 multiple-choices theory test questions. Model accuracy, cost and processing latency are measured from the experiments. Experiment results show that while model GPT-4 passes the test with improved domain knowledge and Ernie has an accuracy of 85% (just below the 86% passing threshold), other LLM models including GPT-3.5 fail the test. For the test questions with images, the multimodal model GPT4-o has an excellent accuracy result of 96%, and the MiniCPM-Llama3-V2.5 achieves an accuracy of 76%. While GPT-4 holds stronger potential for CAV driving assistance applications, the cost of using model GPT4 is much higher, almost 50 times of that of using GPT3.5. The results can help make decision on the use of the existing LLMs for CAV applications and balancing on the model performance and cost.
翻译:处理长尾极端场景是自动驾驶车辆面临的主要挑战。尽管大型语言模型凭借其卓越的泛化与解释能力在应对极端场景方面展现出巨大潜力,且在自动驾驶领域的应用研究日益增多,但仍存在亟待解决的技术障碍,例如LLMs严格的性能要求与巨大的计算资源需求。本文研究了一种应用远程或边缘LLMs支持自动驾驶的新方法。此类LLM辅助驾驶系统的关键问题在于评估LLMs对驾驶理论与技能的理解程度,确保其具备为网联自动驾驶车辆执行安全关键驾驶辅助任务的资质。我们设计并运行了驾驶理论测试,针对多个专有LLM模型(OpenAI GPT系列、百度文心一言与阿里通义千问)及开源LLM模型(清华MiniCPM-2B与MiniCPM-Llama3-V2.5),使用了超过500道选择题进行理论测试。实验测量了模型准确率、成本与处理延迟。实验结果表明:GPT-4模型凭借提升的领域知识通过测试,文心一言准确率达85%(略低于86%的通过阈值),而包括GPT-3.5在内的其他LLM模型均未通过测试。对于含图像的测试题,多模态模型GPT-4o取得了96%的优异准确率,MiniCPM-Llama3-V2.5则达到76%的准确率。尽管GPT-4在CAV驾驶辅助应用中展现出更强潜力,但其使用成本远高于GPT-3.5,近乎后者的50倍。本研究结果可为CAV应用中选择现有LLMs提供决策依据,并在模型性能与成本之间实现平衡。