Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA \citep{singh_flava_2022}. In accordance with Babylm \citep{warstadt2023papers} guidelines, we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset \citep{galvez_peoples_2021}. To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.
翻译:对输入的多模态数据进行训练能够增强语言模型的能力。本文探究这种训练模式是否也能提升此类系统的质量与效率。我们聚焦于文本-音频模态,受FLAVA文本-图像方法\citep{singh_flava_2022}的启发提出Whisbert模型。遵循Babylm\citep{warstadt2023papers}准则,我们在仅含一亿词及其对应语音的数据集上预训练Whisbert,该数据集源自People's Speech数据集的词对齐版本\citep{galvez_peoples_2021}。为评估多模态的影响,我们比较了纯文本训练与音频-文本联合训练的模型版本。研究发现:尽管Whisbert在多模态掩码建模任务中表现优异,且在多数基准测试中超越Babylm基线模型,但其在优化复杂目标函数及超越纯文本Whisbert基线方面仍存在困难。