In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
翻译:在自动驾驶等现实场景中,对罕见事件的泛化能力始终是一项根本性挑战。为应对这一挑战,我们引入了一个专为端到端驾驶设计的新数据集,重点聚焦长尾驾驶事件。我们提供多视角视频数据、轨迹、高层指令以及详细的推理轨迹,从而支持情境学习和少样本泛化。由此构建的多模态模型(如视觉语言模型VLM和视觉语言动作模型VLA)基准,不仅评估安全性与舒适性指标,还考察模型输出对指令遵循程度及语义一致性。来自不同文化背景领域专家的多语言推理轨迹涵盖英语、西班牙语和中文。因此,本数据集是研究不同推理形式如何影响驾驶能力的独特资源。数据集获取地址:https://hf.co/datasets/kit-mrt/kitscenes-longtail