Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.
翻译:时间序列异常检测(TSAD)需要同时识别即时点异常和长程上下文异常。然而,现有基础模型面临一个根本性权衡:一维时序模型能提供细粒度的逐点定位,但缺乏全局上下文视角;而基于二维视觉的模型虽能捕捉全局模式,却因缺乏时序对齐和粗粒度的逐点检测而存在信息瓶颈。为解决这一困境,我们提出了VETime,首个通过细粒度视觉-时序对齐与动态融合来统一时序与视觉模态的TSAD框架。VETime引入了可逆图像转换和块级时序对齐模块,以建立共享的视觉-时序时间线,在保持时序敏感性的同时保留判别性细节。此外,我们设计了异常窗口对比学习机制和任务自适应多模态融合,以自适应地整合两种模态的互补感知优势。大量实验表明,VETime在零样本场景中显著优于现有最先进模型,在比当前基于视觉的方法更低的计算开销下实现了更优的定位精度。代码发布于:https://github.com/yyyangcoder/VETime。