Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods. Code is available at: https://github.com/seunghan96/cfa.
翻译:近期多模态学习的进展推动了将文本或视觉等辅助模态整合到时间序列预测中。然而,现有大多数方法带来的收益有限,往往仅在特定数据集上提升性能,或依赖架构特定设计而限制了泛化能力。本文表明,采用朴素融合策略(如简单相加或拼接)的多模态模型通常性能不及单模态时间序列模型,我们将其归因于对辅助模态的不受控融合——这可能会引入无关信息。基于这一观察,我们探索了多种旨在控制这种融合的受限融合方法,发现它们始终优于朴素融合方法。此外,我们提出受控融合适配器(CFA),这是一种简单的即插即用方法,能在不修改时间序列骨干网络的情况下实现受控的跨模态交互,仅整合与时间序列动态对齐的相关文本信息。CFA利用低秩适配器在将文本信息融合到时间表征前过滤无关信息。我们在超过20K个实验(涵盖多种数据集与时间序列/文本模型)中验证了受限融合方法的有效性。代码开源地址:https://github.com/seunghan96/cfa。