This paper addresses the problem of on-road object importance estimation, which utilizes video sequences captured from the driver's perspective as the input. Although this problem is significant for safer and smarter driving systems, the exploration of this problem remains limited. On one hand, publicly-available large-scale datasets are scarce in the community. To address this dilemma, this paper contributes a new large-scale dataset named Traffic Object Importance (TOI). On the other hand, existing methods often only consider either bottom-up feature or single-fold guidance, leading to limitations in handling highly dynamic and diverse traffic scenarios. Different from existing methods, this paper proposes a model that integrates multi-fold top-down guidance with the bottom-up feature. Specifically, three kinds of top-down guidance factors (ie, driver intention, semantic context, and traffic rule) are integrated into our model. These factors are important for object importance estimation, but none of the existing methods simultaneously consider them. To our knowledge, this paper proposes the first on-road object importance estimation model that fuses multi-fold top-down guidance factors with bottom-up feature. Extensive experiments demonstrate that our model outperforms state-of-the-art methods by large margins, achieving 23.1% Average Precision (AP) improvement compared with the recently proposed model (ie, Goal).
翻译:本文研究道路目标重要性估计问题,该问题以驾驶员视角拍摄的视频序列作为输入。尽管该问题对于实现更安全、更智能的驾驶系统具有重要意义,但相关探索仍较为有限。一方面,该领域缺乏公开的大规模数据集。为应对这一困境,本文构建了名为交通目标重要性(TOI)的新大规模数据集。另一方面,现有方法通常仅考虑自下而上的特征或单一维度的引导,导致在处理高度动态且多样化的交通场景时存在局限。与现有方法不同,本文提出了一种将多维度自上而下引导与自下而上特征相融合的模型。具体而言,模型中整合了三种自上而下的引导因素(即驾驶员意图、语义上下文与交通规则)。这些因素对目标重要性估计至关重要,但现有方法均未同时考虑它们。据我们所知,本文首次提出了融合多维度自上而下引导因素与自下而上特征的道路目标重要性估计模型。大量实验表明,我们的模型以显著优势超越现有最优方法,与近期提出的模型(即Goal)相比,平均精度(AP)提升了23.1%。