The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Further, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
翻译:商用自动驾驶车辆(AVs)及高级驾驶辅助系统(ADAS)的广泛普及可能很大程度上取决于社会对其的接受度,而乘客对其可信度与可解释性的感知至关重要。总体而言,该任务极具挑战性,因为现代自动驾驶系统软件严重依赖黑箱式人工智能模型。为实现该目标,本文提出一种新型数据集Rank2Tell,这是一个用于评估重要程度并解释其缘由的多模态自我中心数据集。通过多种封闭式和开放式视觉问答任务,该数据集对复杂交通场景中各类重要对象的语义、空间、时序及关系属性提供了密集标注。其密集的标注与独特的属性使其成为视觉场景理解及相关领域研究人员的宝贵资源。此外,我们提出一个联合模型,用于同步完成重要性等级排序与自然语言描述生成,以基准测试该数据集,并通过定量评估展示其性能。