教师引导的伪监督与跨模态对齐用于视听视频解析 (Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing) - 专知论文

会员服务 ·

0

监督 · contrastive · Learning · 移动平均 · 阈值 ·

2025 年 9 月 17 日

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

翻译：教师引导的伪监督与跨模态对齐用于视听视频解析

Yaru Chen,Ruohao Guo,Liting Gao,Yang Xiang,Qingyu Luo,Zhenbo Li,Wenwu Wang

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.

翻译：弱监督视听视频解析（AVVP）旨在无需时序标注的情况下检测可听、可见及视听事件。先前研究侧重于通过对比学习或协作学习优化全局预测，但忽视了稳定的片段级监督和类感知的跨模态对齐。为此，我们提出两种策略：（1）基于指数移动平均（EMA）引导的伪监督框架，通过自适应阈值或top-k选择生成可靠的片段级掩码，提供超越视频级标签的稳定时序指导；（2）类感知跨模态一致性（CMA）损失，在可靠的片段-类别对上对齐音频与视觉嵌入，确保跨模态一致性的同时保持时序结构。在LLP和UnAV-100数据集上的评估表明，我们的方法在多项指标上均达到了最先进（SOTA）性能。

0

相关内容

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

基于DASH的交互式三维视频系统建模

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Arxiv

0+阅读 · 2025年10月16日

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

Arxiv

0+阅读 · 2025年10月15日

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Arxiv

0+阅读 · 2025年10月14日

Meta-Transfer Learning for Zero-Shot Super-Resolution

Meta-Transfer Learning for Zero-Shot Super-Resolution

Arxiv

43+阅读 · 2020年2月27日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

VIP会员

文章信息

相关主题

相关VIP内容

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

具身智能中的心理世界建模：深度综述

《无人机是机动力量吗？无人系统在战场上不断演变的角色》最新报告

【伯克利博士论文】协同语言智能体

美军在联盟作战管理演练中测试人工智能赋能人机协同

相关资讯

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

相关论文

Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Arxiv

0+阅读 · 2025年10月16日

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

Arxiv

0+阅读 · 2025年10月15日

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Arxiv

0+阅读 · 2025年10月14日

Meta-Transfer Learning for Zero-Shot Super-Resolution

Meta-Transfer Learning for Zero-Shot Super-Resolution

Arxiv

43+阅读 · 2020年2月27日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

相关基金

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

基于DASH的交互式三维视频系统建模

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员