Tensor data analysis allows researchers to uncover novel patterns and relationships that cannot be obtained from matrix data alone. The information inferred from the patterns provides valuable insights into disease progression, bioproduction processes, weather fluctuations, and group dynamics. However, spurious and redundant patterns hamper this process. This work aims at proposing a statistical frame to assess the probability of patterns in tensor data to deviate from null expectations, extending well-established principles for assessing the statistical significance of patterns in matrix data. A comprehensive discussion on binomial testing for false positive discoveries is entailed at the light of: variable dependencies, temporal dependencies and misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg procedure. Results gathered from the application of state-of-the-art triclustering algorithms over distinct real-world case studies in biochemical and biotechnological domains confer validity to the proposed statistical frame while revealing vulnerabilities of some triclustering searches. The proposed assessment can be incorporated into existing triclustering algorithms to mitigate false positive/spurious discoveries and further prune the search space, reducing their computational complexity. Availability: The code is freely available at https://github.com/JupitersMight/TriSig under the MIT license.
翻译:张量数据分析使研究人员能够揭示仅从矩阵数据中无法获得的新颖模式与关联。从这些模式中推断出的信息为疾病进展、生物生产过程、天气波动及群体动态提供了宝贵洞见。然而,虚假和冗余模式阻碍了这一过程。本研究旨在提出一个统计框架,以评估张量数据中模式偏离零期望的概率,扩展了评估矩阵数据模式统计显著性的成熟原则。本文在以下背景下全面讨论了针对假阳性发现的二项检验:变量依赖性、时间依赖性与错位问题,以及采用Benjamini-Hochberg程序进行p值校正。将最先进的三聚类算法应用于生物化学与生物技术领域的多个真实世界案例研究的结果,验证了所提统计框架的有效性,同时揭示了部分三聚类搜索的脆弱性。该评估方法可整合到现有三聚类算法中,以减轻假阳性/虚假发现,并进一步剪枝搜索空间,从而降低其计算复杂度。获取方式:代码以MIT许可证在https://github.com/JupitersMight/TriSig自由获取。