We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at github.com/khanld/chunkformer.
翻译:本文提出ViP-VL,一种高效的基于向量量化学习的越南语自监督语音预训练模型。为弥合高分辨率音频与高效处理之间的差距,ViP-VL在ChunkFormer架构中引入声学堆叠与感受野对齐技术,实现同步八倍降采样率,同时通过BEST-RQ框架中的专用掩码选择策略进一步提升预训练阶段的表征鲁棒性。该模型在17,000小时无标注越南语语音上完成预训练,在自动语音识别、语音情感识别、方言分类及说话人验证四项主要下游任务中均取得当前最优结果。为促进高性能越南语语音技术的后续研究与应用,我们已在github.com/khanld/chunkformer公开发布预训练模型权重及实现代码。