Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in https://github.com/taco-group/OpenEMMA.
翻译:自多模态大语言模型(MLLMs)问世以来,它们已在广泛的现实应用中产生重大影响,尤其是在自动驾驶领域。其处理复杂视觉数据并对精细驾驶场景进行推理的能力,为端到端自动驾驶系统开辟了新的范式。然而,面向自动驾驶的端到端模型发展进展缓慢,因为现有的微调方法需要大量资源,包括强大的计算能力、大规模数据集和巨额资金。受近期推理计算进展的启发,我们提出了OpenEMMA,一个基于MLLMs的开源端到端框架。通过融入思维链推理过程,OpenEMMA在利用多种MLLMs时相比基线模型取得了显著提升。此外,OpenEMMA在多种具有挑战性的驾驶场景中展现了有效性、泛化性和鲁棒性,为自动驾驶提供了一种更高效、更有效的途径。我们已在https://github.com/taco-group/OpenEMMA发布全部代码。