Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.
翻译:异常识别在监控、交通、医疗和公共安全领域发挥着至关重要的作用。然而,现有方法大多仅依赖视觉数据,使其在遮挡、低光照和恶劣天气等挑战性条件下不可靠。此外,大规模同步视听数据集的缺失阻碍了多模态异常识别研究的进展。为应对这些局限性,本研究提出了AVAR-Net,一种专为真实环境设计的轻量高效视听异常识别框架。AVAR-Net包含四个核心模块:音频特征提取器、视频特征提取器、融合策略以及用于异常识别的跨模态关系建模序列模式学习网络。具体而言,Wav2Vec2模型从原始音频中提取鲁棒的时序特征,而MobileViT从视频帧中捕获局部与全局视觉表征。通过早期融合机制整合这些模态特征,并采用多阶段时序卷积网络(MTCN)学习融合表征中的长程时序依赖,从而实现鲁棒的时空推理。本研究还构建了新颖的视听异常识别(VAAR)数据集,该数据集作为中等规模基准,包含涵盖十种异常类别的3000段真实世界同步音视频。实验评估表明,AVAR-Net在VAAR数据集上达到89.29%的准确率,在XD-Violence数据集上获得88.56%的平均精度,较现有最优方法平均精度提升2.8%。这些结果验证了所提框架的有效性、高效性和泛化能力,同时彰显了VAAR数据集作为推动多模态异常识别研究的基准价值。