Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tensor modes. For example, in a temporal user-item purchases tensor a user influence graph, an item similarity graph, and knowledge about seasonality or trends in the temporal mode may be available. Such side information may enable more succinct and interpretable tensor decomposition models and improved quality in downstream tasks. We propose a framework for Multi-Dictionary Tensor Decomposition (MDTD) which takes advantage of prior structural information about tensor modes in the form of coding dictionaries to obtain sparsely encoded tensor factors. We derive a general optimization algorithm for MDTD that handles both complete input and input with missing values. Our framework handles large sparse tensors typical to many real-world application domains. We demonstrate MDTD's utility via experiments with both synthetic and real-world datasets. It learns more concise models than dictionary-free counterparts and improves (i) reconstruction quality ($60\%$ fewer non-zero coefficients coupled with smaller error); (ii) missing values imputation quality (two-fold MSE reduction with up to orders of magnitude time savings) and (iii) the estimation of the tensor rank. MDTD's quality improvements do not come with a running time premium: it can decompose $19GB$ datasets in less than a minute. It can also impute missing values in sparse billion-entry tensors more accurately and scalably than state-of-the-art competitors.
翻译:张量分解方法通常用于分析来自社交媒体、医疗健康、时空领域等多源数据集。广泛采用的模型如Tucker分解和规范多路分解(CPD)遵循数据驱动的理念:它们将张量分解为能够良好逼近观测数据的因子。在某些情况下,关于张量模式存在辅助信息。例如,在时间-用户-物品购买张量中,可能包含用户影响力图、物品相似性图以及时间模式中关于季节性变化或趋势的知识。此类辅助信息能够支持更简洁、可解释性更强的张量分解模型,并提升下游任务的质量。我们提出了多字典张量分解(MDTD)框架,该框架利用关于张量模式的结构先验信息(以编码字典形式),从而获得稀疏编码的张量因子。我们推导了MDTD的通用优化算法,该算法能处理完整输入和含缺失值的输入。我们的框架适用于许多实际应用领域中典型的大规模稀疏张量。通过合成数据集和真实数据集的实验,我们验证了MDTD的实用性。相比无字典方法,它学到了更简洁的模型,并实现了以下改进:(i)重构质量(非零系数减少60%,同时误差更小);(ii)缺失值插补质量(均方误差降低两倍,时间节省高达数个数量级);以及(iii)张量秩的估计准确性。MDTD的质量提升并未以运行时间为代价:它能在不到一分钟内分解19GB数据集。此外,在含十亿条目的稀疏张量缺失值插补任务中,MDTD比现有最先进方法更精确且更具可扩展性。