In autonomous driving tasks, scene understanding is the first step towards predicting the future behavior of the surrounding traffic participants. Yet, how to represent a given scene and extract its features are still open research questions. In this study, we propose a novel text-based representation of traffic scenes and process it with a pre-trained language encoder. First, we show that text-based representations, combined with classical rasterized image representations, lead to descriptive scene embeddings. Second, we benchmark our predictions on the nuScenes dataset and show significant improvements compared to baselines. Third, we show in an ablation study that a joint encoder of text and rasterized images outperforms the individual encoders confirming that both representations have their complementary strengths.
翻译:在自动驾驶任务中,场景理解是预测周围交通参与者未来行为的第一步。然而,如何表示给定场景并提取其特征仍是待解决的研究问题。本研究提出一种基于文本的新型交通场景表示方法,并利用预训练语言编码器进行处理。首先,我们证明基于文本的表示与经典栅格化图像表示相结合,能够生成描述性的场景嵌入。其次,我们在nuScenes数据集上对预测结果进行基准测试,结果显示相较于基线模型有显著提升。第三,通过消融研究发现,文本与栅格化图像的联合编码器优于单一编码器,证实这两种表示具有互补优势。