The analysis of high-dimensional timeline data and the identification of outliers and anomalies is critical across diverse domains, including sensor readings, biological and medical data, historical records, and global statistics. However, conventional analysis techniques often struggle with challenges such as high dimensionality, complex distributions, and sparsity. These limitations hinder the ability to extract meaningful insights from complex temporal datasets, making it difficult to identify trending features, outliers, and anomalies effectively. Inspired by surprisability -- a cognitive science concept describing how humans instinctively focus on unexpected deviations - we propose Learning via Surprisability (LvS), a novel approach for transforming high-dimensional timeline data. LvS quantifies and prioritizes anomalies in time-series data by formalizing deviations from expected behavior. LvS bridges cognitive theories of attention with computational methods, enabling the detection of anomalies and shifts in a way that preserves critical context, offering a new lens for interpreting complex datasets. We demonstrate the usefulness of LvS on three high-dimensional timeline use cases: a time series of sensor data, a global dataset of mortality causes over multiple years, and a textual corpus containing over two centuries of State of the Union Addresses by U.S. presidents. Our results show that the LvS transformation enables efficient and interpretable identification of outliers, anomalies, and the most variable features along the timeline.
翻译:高维时间线数据的分析及离群值与异常值的识别在传感器读数、生物与医学数据、历史记录和全球统计数据等多个领域至关重要。然而,传统分析技术常面临高维度、复杂分布和稀疏性等挑战。这些限制阻碍了从复杂时序数据集中提取有意义见解的能力,使得有效识别趋势特征、离群值和异常值变得困难。受可惊奇性——一个描述人类如何本能关注意外偏差的认知科学概念——启发,我们提出基于可惊奇性的学习(Learning via Surprisability, LvS),这是一种用于变换高维时间线数据的新方法。LvS通过形式化预期行为的偏差来量化并优先处理时序数据中的异常值。该方法将注意力认知理论与计算方法相连接,能够在保留关键上下文的同时检测异常和变化,为解释复杂数据集提供了新视角。我们在三个高维时间线用例中验证了LvS的有效性:传感器数据时间序列、多年全球死因数据集以及包含两个多世纪美国总统国情咨文的文本语料库。结果表明,LvS变换能够沿时间线高效且可解释地识别离群值、异常值及变化最显著的特征。