AVAR-Net：一种轻量级视听异常识别框架及基准数据集 (AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset)

Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

翻译：异常识别在监控、交通、医疗和公共安全领域发挥着至关重要的作用。然而，现有方法大多仅依赖视觉数据，使其在遮挡、低光照和恶劣天气等挑战性条件下不可靠。此外，大规模同步视听数据集的缺失阻碍了多模态异常识别研究的进展。为应对这些局限性，本研究提出了AVAR-Net，一种专为真实环境设计的轻量高效视听异常识别框架。AVAR-Net包含四个核心模块：音频特征提取器、视频特征提取器、融合策略以及用于异常识别的跨模态关系建模序列模式学习网络。具体而言，Wav2Vec2模型从原始音频中提取鲁棒的时序特征，而MobileViT从视频帧中捕获局部与全局视觉表征。通过早期融合机制整合这些模态特征，并采用多阶段时序卷积网络（MTCN）学习融合表征中的长程时序依赖，从而实现鲁棒的时空推理。本研究还构建了新颖的视听异常识别（VAAR）数据集，该数据集作为中等规模基准，包含涵盖十种异常类别的3000段真实世界同步音视频。实验评估表明，AVAR-Net在VAAR数据集上达到89.29%的准确率，在XD-Violence数据集上获得88.56%的平均精度，较现有最优方法平均精度提升2.8%。这些结果验证了所提框架的有效性、高效性和泛化能力，同时彰显了VAAR数据集作为推动多模态异常识别研究的基准价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日