Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. This work focuses on classification of traffic scenes into specific accident types. We approach the problem by representing a traffic scene as a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of a traffic scene is referred to as a scene graph, and can be used as input for an accident classifier. Better results are obtained with a classifier that fuses the scene graph input with visual and textual representations. This work introduces a multi-stage, multimodal pipeline that pre-processes videos of traffic accidents, encodes them as scene graphs, and aligns this representation with vision and language modalities before executing the classification task. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.
翻译:交通事故识别是自动驾驶或道路监控系统的核心组成部分。事故可能以多种形式出现,理解正在发生的事故类型有助于防止其再次发生。本研究专注于将交通场景分类为特定事故类型。我们通过将交通场景表示为图来解决该问题,其中汽车等对象可表示为节点,对象间的相对距离与方向则表示为边。这种交通场景表示称为场景图,可作为事故分类器的输入。通过将场景图输入与视觉及文本表征相融合的分类器可获得更优结果。本文提出一种多阶段多模态处理流程:首先对交通事故视频进行预处理,将其编码为场景图,在执行分类任务前将该表征与视觉及语言模态对齐。在4个类别上进行训练时,我们的方法在流行的交通异常检测基准测试的不平衡子集上取得了57.77%的平衡准确率,相较于未考虑场景图信息的情况提升了近5个百分点。