Tensor data analysis allows researchers to uncover novel patterns and relationships that cannot be obtained from matrix data alone. The information inferred from the patterns provides valuable insights into disease progression, bioproduction processes, weather fluctuations, and group dynamics. However, spurious and redundant patterns hamper this process. This work aims at proposing a statistical frame to assess the probability of patterns in tensor data to deviate from null expectations, extending well-established principles for assessing the statistical significance of patterns in matrix data. A comprehensive discussion on binomial testing for false positive discoveries is entailed at the light of: variable dependencies, temporal dependencies and misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg procedure. Results gathered from the application of state-of-the-art triclustering algorithms over distinct real-world case studies in biochemical and biotechnological domains confer validity to the proposed statistical frame while revealing vulnerabilities of some triclustering searches. The proposed assessment can be incorporated into existing triclustering algorithms to mitigate false positive/spurious discoveries and further prune the search space, reducing their computational complexity. Availability: The code is freely available at https://github.com/JupitersMight/TriSig under the MIT license.
翻译:张量数据分析使研究人员能够揭示仅从矩阵数据中无法获得的新模式与关系。从这些模式中推断出的信息为疾病进展、生物生产过程、天气波动和群体动力学提供了宝贵见解。然而,虚假和冗余的模式阻碍了这一过程。本研究旨在提出一个统计框架,用于评估张量数据中的模式偏离零假设的概率,从而扩展评估矩阵数据中模式统计显著性的成熟原则。本文在变量依赖性、时间依赖性及其错位,以及本杰明尼-霍赫伯格程序下的p值校正背景下,对基于二项式检验的假阳性发现检测进行了全面讨论。通过对生化与生物技术领域中不同真实世界案例应用最新三聚类算法所获得的结果,验证了所提统计框架的有效性,同时揭示了某些三聚类搜索的脆弱性。该评估方法可集成至现有三聚类算法中,以减轻假阳性/虚假发现并进一步剪枝搜索空间,从而降低计算复杂度。可用性:代码以MIT许可证在https://github.com/JupitersMight/TriSig 免费获取。