Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects audio-encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former) and then prepends or inserts them into the text token sequence. We refer to this generic scheme as Prepend to the LLM's input token space (PLITS) integration. We propose an efficient alternative, Lightweight Audio LLM Integration (LAL). LAL injects audio representations solely through the attention mechanism at selected LLM layers, bypassing the feed-forward module. It encodes rich audio semantics at an appropriate level of abstraction for integration into different transformer blocks, substantially reducing computational overhead compared to existing approaches. We further introduce PAL, a hybrid integration approach for efficiently Probing Audio encoders via LLM. PAL applies PLITS only to a compact set of summary tokens while integrating the full audio token sequence via LAL. Under an identical training curriculum, LAL consistently matches or outperforms existing integration approaches across multiple base LLMs and tasks, with improvements of up to 30% over a strong PLITS baseline, while reducing memory usage by about 60% and increasing throughput by about 190%. Moreover, PAL matches or exceeds PLITS performance while offering substantially better computational and memory efficiency.
翻译:将音频感知能力整合到大语言模型(LLM)中是实现机器听觉应用的新兴研究方向,然而如何将丰富的音频语义从音频编码器高效迁移至LLM仍未得到充分探索。当前最广泛采用的整合范式是将音频编码器输出的令牌投影到LLM输入空间(例如通过MLP或Q-Former),然后将其前置或插入到文本令牌序列中。我们将这种通用方案称为"预置至LLM输入令牌空间"(PLITS)整合范式。本文提出一种高效的替代方案——轻量级音频LLM整合(LAL)。LAL仅通过注意力机制在选定的LLM层中注入音频表征,绕过了前馈模块。该方法将丰富的音频语义编码为适合整合到不同Transformer模块的抽象层级,与现有方法相比显著降低了计算开销。我们进一步提出PAL——一种通过LLM高效探测音频编码器的混合整合方法。PAL仅对紧凑的摘要令牌集合应用PLITS,同时通过LAL整合完整的音频令牌序列。在相同的训练策略下,LAL在多种基础LLM和任务中持续匹配或超越现有整合方法,相较于强PLITS基线最高提升30%性能,同时减少约60%内存占用并提升约190%吞吐量。此外,PAL在保持与PLITS相当或更优性能的同时,提供了显著更优的计算与内存效率。