Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation - 专知论文

会员服务 ·

0

语音识别 · Learning · INFORMS · Performer · Processing（编程语言） ·

2024 年 1 月 7 日

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

翻译：多通道AV-wav2vec2：一种学习多通道多模态语音表征的框架

Qiushi Zhu,Jie Zhang,Yu Gu,Yuchen Hu,Lirong Dai

from arxiv, Accepted by AAAI 2024

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.

翻译：自监督语音预训练方法近年来发展迅速，在近场单通道语音任务中展现出极高的有效性。然而，远场多通道语音处理仍面临标注多通道数据稀缺与复杂环境噪声的双重挑战。自监督学习在远场多通道及多模态语音处理中的效能尚未得到充分探索。鉴于视觉信息有助于提升噪声场景下的语音识别性能，本文提出一种多通道多模态语音自监督学习框架AV-wav2vec2，该框架以视频与多通道音频数据为输入。首先，我们设计多路径结构并行处理多通道音频流与视觉流，以通道内和通道间对比损失作为训练目标，充分挖掘多通道语音数据中的时空信息。其次，基于对比学习方法，引入额外单通道音频数据进行联合训练，以提升语音表征性能。最后，采用真实场景下的中文多通道多模态数据集，在视听语音识别（AVSR）、自动语音识别（ASR）、视觉语音识别（VSR）及视听说话人日记化（AVSD）任务中验证了所提方法的有效性。

0

相关内容

语音识别

语音识别是计算机科学和计算语言学的一个跨学科子领域，它发展了一些方法和技术，使计算机可以将口语识别和翻译成文本。它也被称为自动语音识别（ASR），计算机语音识别或语音转文本（STT）。它整合了计算机科学，语言学和计算机工程领域的知识和研究。

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

3+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

基于DASH的交互式三维视频系统建模

国家自然科学基金

1+阅读 · 2015年12月31日

非线性Schrödinger方程孤立子和怪波的数值方法

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

“杰文斯”悖论、能效政策改进与“双控目标”分解

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

Model Composition for Multimodal Large Language Models

Arxiv

0+阅读 · 2024年2月20日

Direct Consistency Optimization for Compositional Text-to-Image Personalization

Arxiv

0+阅读 · 2024年2月19日

A Mass-Conserving-Perceptron for Machine Learning-Based Modeling of Geoscientific Systems

Arxiv

0+阅读 · 2024年2月16日

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

Arxiv

0+阅读 · 2024年2月16日

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Arxiv

0+阅读 · 2024年2月16日

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Arxiv

0+阅读 · 2024年2月15日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Arxiv

11+阅读 · 2022年12月1日

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Arxiv

10+阅读 · 2021年12月14日

Affective Image Content Analysis: Two Decades Review and New Perspectives

Arxiv

16+阅读 · 2021年6月30日

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Arxiv

15+阅读 · 2021年2月9日

VIP会员

文章信息

相关主题

Processing（编程语言）

最新内容

《美国首席数字与人工智能办公室（CDAO）人工智能治理与采办流程效能评估》报告

《美国首席数字与人工智能办公室（CDAO）人工智能治理与采办流程效能评估》报告

专知会员服务

3+阅读 · 今天3:36

算法战加速推进：五角大楼项目、供应商生态体系与军事创新的战略重塑

算法战加速推进：五角大楼项目、供应商生态体系与军事创新的战略重塑

专知会员服务

1+阅读 · 今天3:23

探秘Palantir：驱动美情报的科技巨头

探秘Palantir：驱动美情报的科技巨头

专知会员服务

2+阅读 · 今天3:14

《从技术突破到战场应用：发挥原型开发效能的最佳实践》报告

《从技术突破到战场应用：发挥原型开发效能的最佳实践》报告

专知会员服务

3+阅读 · 今天3:09

《美国海军军事海运司令部 2026年手册》

《美国海军军事海运司令部 2026年手册》

专知会员服务

2+阅读 · 今天3:05

别再只盯着“杀手机器人”了：人工智能真正变革现代战争的三种方式

别再只盯着“杀手机器人”了：人工智能真正变革现代战争的三种方式

专知会员服务

1+阅读 · 今天2:36

《人工智能使能系统可靠性框架》

《人工智能使能系统可靠性框架》

专知会员服务

5+阅读 · 今天2:28

2026“人工智能+”行业发展蓝皮书（附下载）

2026“人工智能+”行业发展蓝皮书（附下载）

专知会员服务

14+阅读 · 4月26日

《强化学习数学基础》

《强化学习数学基础》

专知会员服务

11+阅读 · 4月26日

何为下一代指挥与控制？美陆军选择第四步兵师进行快速原型NGC2开发

何为下一代指挥与控制？美陆军选择第四步兵师进行快速原型NGC2开发

专知会员服务

7+阅读 · 4月26日

《低成本自杀式无人机战争的军事战略影响：以乌克兰和伊朗为案例研究》

《低成本自杀式无人机战争的军事战略影响：以乌克兰和伊朗为案例研究》

专知会员服务

6+阅读 · 4月26日

深入Maven智能系统：Palantir基于Claude打造的军事大脑

深入Maven智能系统：Palantir基于Claude打造的军事大脑

专知会员服务

12+阅读 · 4月26日

“Maven计划”的发展演变之“Maven智能系统”应用

“Maven计划”的发展演变之“Maven智能系统”应用

专知会员服务

10+阅读 · 4月26日

伊朗的无人机蜂群策略如何挑战美国防御系统：人工智能驱动的无人机战争与现代冲突的转型

伊朗的无人机蜂群策略如何挑战美国防御系统：人工智能驱动的无人机战争与现代冲突的转型

专知会员服务

7+阅读 · 4月26日

《将小型无人机系统与巡飞弹集成至连及以下级别战术机动》（美陆军最新报告中文版）

《将小型无人机系统与巡飞弹集成至连及以下级别战术机动》（美陆军最新报告中文版）

专知会员服务

9+阅读 · 4月26日

相关VIP内容

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

算法战加速推进：五角大楼项目、供应商生态体系与军事创新的战略重塑

《从技术突破到战场应用：发挥原型开发效能的最佳实践》报告

《美国首席数字与人工智能办公室（CDAO）人工智能治理与采办流程效能评估》报告

探秘Palantir：驱动美情报的科技巨头

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Model Composition for Multimodal Large Language Models

Arxiv

0+阅读 · 2024年2月20日

Direct Consistency Optimization for Compositional Text-to-Image Personalization

Arxiv

0+阅读 · 2024年2月19日

A Mass-Conserving-Perceptron for Machine Learning-Based Modeling of Geoscientific Systems

Arxiv

0+阅读 · 2024年2月16日

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

Arxiv

0+阅读 · 2024年2月16日

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Arxiv

0+阅读 · 2024年2月16日

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Arxiv

0+阅读 · 2024年2月15日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Arxiv

11+阅读 · 2022年12月1日

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Arxiv

10+阅读 · 2021年12月14日

Affective Image Content Analysis: Two Decades Review and New Perspectives

Arxiv

16+阅读 · 2021年6月30日

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Arxiv

15+阅读 · 2021年2月9日

相关基金

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

3+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

基于DASH的交互式三维视频系统建模

国家自然科学基金

1+阅读 · 2015年12月31日

非线性Schrödinger方程孤立子和怪波的数值方法

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

“杰文斯”悖论、能效政策改进与“双控目标”分解

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员