Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities within the domains of Intelligent Traffic Monitoring and Intelligent Transportation Systems. Nevertheless, the integration of urban traffic scene knowledge into VidQA systems has received limited attention in previous research endeavors. In this work, we present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models. Empirical findings obtained from the SUTD-TrafficQA task highlight the substantial enhancements achieved by TRIVIA, elevating the accuracy of representative video-language models by a remarkable 6.5 points (19.88%) compared to baseline settings. This pioneering methodology holds great promise for driving advancements in the field, inspiring researchers and practitioners alike to unlock the full potential of emerging video-language models in traffic-related applications.
翻译:视频问答(VidQA)在智能交通监控与智能交通系统等领域展现出促进高级机器推理能力的显著潜力。然而,以往研究工作中对将城市交通场景知识整合到视频问答系统的关注有限。本文提出一种称为"交通领域视频问答与自动字幕生成"(TRIVIA)的新方法,作为一种弱监督技术,用于将交通领域知识注入大规模视频语言模型。从SUTD-TrafficQA任务中获得的实证结果凸显了TRIVIA实现的显著提升,相较于基线设置,代表性视频语言模型的准确率提升了6.5个百分点(19.88%)。这一开创性方法有望推动该领域发展,激励研究人员和从业者充分挖掘新兴视频语言模型在交通相关应用中的潜力。