Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.
翻译:将多模态大语言模型扩展到长语音受限于输入令牌的爆炸式增长。与图像或视频不同,音频缺乏重叠信息,导致极端1令牌压缩极易丢失细粒度声学线索。为克服这一问题,我们提出FastSLM——一种具有层级化时间抽象器的令牌高效架构。HTA在多个时间尺度上逐步蒸馏非重叠声学特征,实现每秒1.67令牌的极致压缩率(减少97%),且不损失关键上下文信息。实验结果表明,尽管使用的FLOPs和参数显著减少,FastSLM在长语音基准测试中仍取得了与最先进模型具有竞争力的性能。源代码与模型检查点见https://anonymous.4open.science/r/FastSLM-8BD3。