Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model's ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model's representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.
翻译:归一化与缩放是时间序列建模中的基础预处理步骤,然而它们在基于Transformer的模型中的作用从理论角度仍未得到充分探索。本研究首次对不同归一化策略(特别是基于实例的缩放与全局缩放)如何影响基于Transformer架构在时间序列表示学习中的表达能力进行了形式化分析。我们提出了一个专为时间序列设计的表达能力框架,该框架量化了模型在表示空间中区分相似与不相似输入的能力。基于此框架,我们推导了两种广泛使用的归一化方法(标准化缩放与最小-最大缩放)的理论边界。分析表明,归一化策略的选择会显著影响模型的表示能力,其效果取决于具体任务与数据特征。我们通过使用多种基于Transformer的模型在分类与预测基准测试上进行实证验证,补充了理论分析。结果显示,没有单一归一化方法能持续优于其他方法,在某些情况下完全省略归一化反而能获得更优性能。这些发现凸显了预处理在时间序列学习中的关键作用,并表明需要针对特定任务与数据集设计更具原则性的归一化策略。