Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective numerical-visual modality integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.
翻译:多模态大语言模型(MLLMs)的进展推动了时间序列理解与推理任务的发展,使得能够对时间序列进行自然语言查询,并生成关于复杂时序动态的文本分析。近期研究尝试将数值时间序列与其可视化图表相结合,以促进MLLMs在精确数值推理和视觉结构理解方面的综合时间序列理解能力。然而,由于跨模态间细粒度的时间错位以及共享语义与模态特定语义之间的严重纠缠,有效的数值-视觉模态融合仍然面临挑战,这阻碍了局部化解释与互补性推理。为解决这些问题,我们提出了MADI,一种通过细粒度对齐与解耦交互增强的多模态大语言模型,其特点包括:(1)**片段级对齐**,强制建立跨异构模态间基于物理基础的细粒度对应关系;(2)**离散解耦交互**,将模态共有语义分离为紧凑的离散潜在表示,并自适应地协同纯化后的模态独特信息;(3)**关键令牌突出**,强调信息丰富且与查询相关的信号以进行稳健推理。在合成与真实世界基准测试上的实验表明,MADI在性能上持续优于通用大语言模型及时间序列专用多模态大语言模型。