Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from \deepscenario--an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios, to form an overall dataset of 576 scenarios. With this dataset, we evaluated three LLMs (\gpt, \llama, and \mistral) to assess their robustness in assessing the realism of driving scenarios. Our results show that: (1) Overall, \gpt achieved the highest robustness compared to \llama and \mistral, consistently throughout almost all scenarios, roads, and weather conditions; (2) \mistral performed the worst consistently; (3) \llama achieved good results under certain conditions; and (4) roads and weather conditions do influence the robustness of the LLMs.
翻译:大语言模型(LLMs)在文本生成、摘要和分类等任务中展现出卓越潜力。鉴于这些模型基于海量网络知识训练而成,我们假设LLMs能够评估自动驾驶测试技术生成的驾驶场景是否真实,即是否与真实驾驶条件一致。为验证这一假设,我们开展了一项实证研究,评估LLMs在完成该任务时的有效性与鲁棒性。这项现实检验是开发基于LLM的自动驾驶测试技术的关键步骤。在实证评估中,我们从开放驾驶场景数据集\deepscenario中选取了64个真实场景,随后通过引入微小改动创建了512个附加真实场景,最终形成包含576个场景的完整数据集。利用该数据集,我们评估了三个大语言模型(\gpt、\llama和\mistral)在判断驾驶场景真实性方面的鲁棒性。研究结果表明:(1)总体而言,\gpt在几乎所有场景、道路和天气条件下均展现出最高的鲁棒性,持续优于\llama和\mistral;(2)\mistral始终表现最差;(3)\llama在特定条件下取得良好效果;(4)道路和天气条件确实会影响LLMs的鲁棒性。