Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval - 专知论文

会员服务 ·

0

矩 · 目标领域 · Networking · MoDELS · Performer ·

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

翻译：暂无翻译

Xiang Fang,Daizong Liu,Pan Zhou,Yuchong Hu

from arxiv, Accepted by IEEE Transactions on Multimedia

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

翻译：暂无翻译

0

相关内容

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

专知会员服务

24+阅读 · 2022年2月6日

【AAAI2021】MVFNet: 用于高效视频识别的多视角融合网络

专知会员服务

11+阅读 · 2021年2月4日

【ACM Multimedia 2020】双时间存储网络有效的视频对象分割

【ACM Multimedia 2020】双时间存储网络有效的视频对象分割

专知会员服务

10+阅读 · 2020年8月13日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

【ACM MM 2019 】MMGCN：用于微视频个性化推荐的多模图卷积网络（MMGCN：Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video）

【ACM MM 2019 】MMGCN：用于微视频个性化推荐的多模图卷积网络（MMGCN：Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video）

专知会员服务

57+阅读 · 2019年11月20日

【目标跟踪 | 2019最新综述】多目标追踪综述，附38页PDF，185篇参考文献，Deep Learning in Video Multi-Object Tracking: A Survey

【目标跟踪 | 2019最新综述】多目标追踪综述，附38页PDF，185篇参考文献，Deep Learning in Video Multi-Object Tracking: A Survey

专知会员服务

93+阅读 · 2019年11月15日

ICCV 2019 开源论文 | 适用于视频分割的全新Attention机制

ICCV 2019 开源论文 | 适用于视频分割的全新Attention机制

PaperWeekly

10+阅读 · 2019年11月9日

专家报告|深度学习+图像多模态融合

专家报告|深度学习+图像多模态融合

中国图象图形学报

12+阅读 · 2019年10月23日

【综述】深度学习在视频多目标跟踪上的应用

【综述】深度学习在视频多目标跟踪上的应用

专知

14+阅读 · 2019年8月8日

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

泡泡机器人SLAM

17+阅读 · 2019年7月8日

【泡泡图灵智库】CNN-SVO 提升半直接视觉里程计的建图效果（arXiv）

【泡泡图灵智库】CNN-SVO 提升半直接视觉里程计的建图效果（arXiv）

泡泡机器人SLAM

29+阅读 · 2019年5月27日

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

新智元

11+阅读 · 2019年3月8日

【干货】计算机视觉视频理解领域的经典方法和最新成果

【干货】计算机视觉视频理解领域的经典方法和最新成果

新智元

15+阅读 · 2018年5月28日

视频超分辨 Detail-revealing Deep Video Super-resolution 论文笔记

视频超分辨 Detail-revealing Deep Video Super-resolution 论文笔记

统计学习与视觉计算组

17+阅读 · 2018年3月16日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【Kaggle 实战分享】谷歌 YouTube-8M 大规模视频理解竞赛技术剖析

【Kaggle 实战分享】谷歌 YouTube-8M 大规模视频理解竞赛技术剖析

新智元

12+阅读 · 2017年8月3日

面向多核DSP的实时视频并行编码关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

欠覆盖环境下城市多源监控视频大数据高效编码方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源视频的大范围场景目标跟踪

国家自然科学基金

2+阅读 · 2015年12月31日

云环境下结合视觉特征的图像视频集编码与传输

国家自然科学基金

1+阅读 · 2015年12月31日

面向无线多媒体传感器网络的高效压缩视频感知

国家自然科学基金

0+阅读 · 2015年12月31日

面向无线异构网络中多媒体信息组播的多速率网络编码理论和应用研究

国家自然科学基金

0+阅读 · 2015年12月31日

在轨视频图像特征提取与压缩关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于压缩域的海量视频浓缩关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

智能视频监控中图像超分辨率重建关键技术研究

国家自然科学基金

4+阅读 · 2014年12月31日

网络化环境下面向态势感知的多无人机协同控制与管理方法

国家自然科学基金

24+阅读 · 2011年12月31日

Driving Video Retrieval for Complex Queries with Structured Grounding

Arxiv

0+阅读 · 6月8日

Linear Scaling Video VLMs for Long Video Understanding

Arxiv

0+阅读 · 5月29日

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Arxiv

0+阅读 · 5月14日

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

Arxiv

0+阅读 · 5月11日

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Arxiv

10+阅读 · 2021年11月29日

Multi-view Contrastive Graph Clustering

Arxiv

13+阅读 · 2021年10月22日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Cross-Modal Self-Attention Network for Referring Image Segmentation

Cross-Modal Self-Attention Network for Referring Image Segmentation

Arxiv

18+阅读 · 2019年4月9日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

19+阅读 · 2018年12月10日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

4+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

6+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

6+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

6+阅读 · 6月22日

美国从乌克兰无人机战争中学习经验

美国从乌克兰无人机战争中学习经验

专知会员服务

7+阅读 · 6月21日

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

专知会员服务

5+阅读 · 6月21日

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

专知会员服务

8+阅读 · 6月21日

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

22+阅读 · 6月20日

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

5+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

8+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

7+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

9+阅读 · 6月18日

相关VIP内容

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

专知会员服务

24+阅读 · 2022年2月6日

【AAAI2021】MVFNet: 用于高效视频识别的多视角融合网络

专知会员服务

11+阅读 · 2021年2月4日

【ACM Multimedia 2020】双时间存储网络有效的视频对象分割

【ACM Multimedia 2020】双时间存储网络有效的视频对象分割

专知会员服务

10+阅读 · 2020年8月13日

【Google】多模态Transformer视频检索，Multi-modal Transformer

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

【ACM MM 2019 】MMGCN：用于微视频个性化推荐的多模图卷积网络（MMGCN：Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video）

【ACM MM 2019 】MMGCN：用于微视频个性化推荐的多模图卷积网络（MMGCN：Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video）

专知会员服务

57+阅读 · 2019年11月20日

【目标跟踪 | 2019最新综述】多目标追踪综述，附38页PDF，185篇参考文献，Deep Learning in Video Multi-Object Tracking: A Survey

【目标跟踪 | 2019最新综述】多目标追踪综述，附38页PDF，185篇参考文献，Deep Learning in Video Multi-Object Tracking: A Survey

专知会员服务

93+阅读 · 2019年11月15日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 3D场景图：开放挑战与未来方向

21世纪的无人机战争

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

相关资讯

ICCV 2019 开源论文 | 适用于视频分割的全新Attention机制

ICCV 2019 开源论文 | 适用于视频分割的全新Attention机制

PaperWeekly

10+阅读 · 2019年11月9日

专家报告|深度学习+图像多模态融合

专家报告|深度学习+图像多模态融合

中国图象图形学报

12+阅读 · 2019年10月23日

【综述】深度学习在视频多目标跟踪上的应用

【综述】深度学习在视频多目标跟踪上的应用

专知

14+阅读 · 2019年8月8日

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

泡泡机器人SLAM

17+阅读 · 2019年7月8日

【泡泡图灵智库】CNN-SVO 提升半直接视觉里程计的建图效果（arXiv）

【泡泡图灵智库】CNN-SVO 提升半直接视觉里程计的建图效果（arXiv）

泡泡机器人SLAM

29+阅读 · 2019年5月27日

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

新智元

11+阅读 · 2019年3月8日

【干货】计算机视觉视频理解领域的经典方法和最新成果

【干货】计算机视觉视频理解领域的经典方法和最新成果

新智元

15+阅读 · 2018年5月28日

视频超分辨 Detail-revealing Deep Video Super-resolution 论文笔记

视频超分辨 Detail-revealing Deep Video Super-resolution 论文笔记

统计学习与视觉计算组

17+阅读 · 2018年3月16日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【Kaggle 实战分享】谷歌 YouTube-8M 大规模视频理解竞赛技术剖析

【Kaggle 实战分享】谷歌 YouTube-8M 大规模视频理解竞赛技术剖析

新智元

12+阅读 · 2017年8月3日

相关论文

Driving Video Retrieval for Complex Queries with Structured Grounding

Arxiv

0+阅读 · 6月8日

Linear Scaling Video VLMs for Long Video Understanding

Arxiv

0+阅读 · 5月29日

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Arxiv

0+阅读 · 5月14日

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

Arxiv

0+阅读 · 5月11日

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Arxiv

10+阅读 · 2021年11月29日

Multi-view Contrastive Graph Clustering

Arxiv

13+阅读 · 2021年10月22日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Cross-Modal Self-Attention Network for Referring Image Segmentation

Cross-Modal Self-Attention Network for Referring Image Segmentation

Arxiv

18+阅读 · 2019年4月9日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

19+阅读 · 2018年12月10日

相关基金

面向多核DSP的实时视频并行编码关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

欠覆盖环境下城市多源监控视频大数据高效编码方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源视频的大范围场景目标跟踪

国家自然科学基金

2+阅读 · 2015年12月31日

云环境下结合视觉特征的图像视频集编码与传输

国家自然科学基金

1+阅读 · 2015年12月31日

面向无线多媒体传感器网络的高效压缩视频感知

国家自然科学基金

0+阅读 · 2015年12月31日

面向无线异构网络中多媒体信息组播的多速率网络编码理论和应用研究

国家自然科学基金

0+阅读 · 2015年12月31日

在轨视频图像特征提取与压缩关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于压缩域的海量视频浓缩关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

智能视频监控中图像超分辨率重建关键技术研究

国家自然科学基金

4+阅读 · 2014年12月31日

网络化环境下面向态势感知的多无人机协同控制与管理方法

国家自然科学基金

24+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员