Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.
翻译:基于Transformer的多模态模型已广泛应用于工业级推荐、搜索和广告系统中的内容理解与相关性排序。提升标注训练数据质量与跨模态融合能力能显著改善模型性能,进而影响质量观看率、广告收入等关键指标。高质量标注对推进内容建模至关重要,然而传统基于统计的主动学习方法存在局限:难以检测过度自信的误分类,且在深度神经网络中区分语义相似样本的效果有限。此外,音频信息的作用日益凸显(尤其在短视频平台),但多数预训练多模态架构主要聚焦文本与图像。虽然从头训练所有三种模态可行,但这会牺牲利用现有视觉语言预训练模型与音频模型的优势。为解决这些挑战,我们提出基于k近邻的潜在空间扩展方法以提升主动学习效率,并构建音频增强的视觉语言建模框架——一种将音频信息融入视觉语言模型的中期融合方案。该系统已部署于生产环境,并带来显著业务收益。