This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We demonstrate its effectiveness on Argumentative Fallacy Detection and Classification tasks where audio was previously believed counterproductive, and affective computing tasks on a widely-used dataset. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
翻译:本文提出了一种简单方法,可在针对特定分类任务进行微调时,利用语音信息轻松增强基于文本的预训练大语言模型。将音频中的多个嵌入与文本融合时,一个典型问题是音频序列长度远超文本序列。我们的方法利用现有为语音识别任务训练的语音分词器,该分词器能从大型词汇表中输出长序列标记,这使得将其低成本集成到大语言模型中变得困难。通过对多模态词袋表示应用简单的基于套索的特征选择,我们仅保留对任务最重要的音频标记,并通过自监督语言建模目标使语言模型适应这些标记,随后在下游任务上进行微调。我们证明,与单模态模型、更大的SpeechLM或通过学习表示集成音频的方法相比,这有助于提升性能。我们在论证谬误检测与分类任务(此前被认为音频对此类任务有反效果)以及一个广泛数据集上的情感计算任务中验证了其有效性。我们还对该方法进行了深入分析,表明即使是随机选择音频标记也能增强单模态模型。我们的代码可在线获取(https://github.com/salocinc/EACL26SpeechTokFallacy/)。