The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Furthermore, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
翻译:商用自动驾驶汽车(AVs)及高级驾驶辅助系统(ADAS)的广泛普及可能很大程度上取决于社会对其的接受程度,而乘客感知到的可信度与可解释性至关重要。这一任务因现代自动驾驶系统软件高度依赖黑箱人工智能模型而颇具挑战性。为此,本文介绍了一个新型多模态自我中心数据集Rank2Tell,用于对重要性层级进行排序并阐释其重要性原因。该数据集通过多种封闭式与开放式视觉问答机制,为复杂交通场景中各类重要对象的语义、空间、时间及关系属性提供了密集标注。这些密集标注及其独特属性使其成为视觉场景理解及相关领域研究人员的宝贵资源。此外,我们提出了一种联合模型,可同时进行重要性层级排序与自然语言描述生成,以对数据集进行基准测试,并通过量化评估展示其性能。