Massive amounts of data are the foundation of data-driven recommendation models. As an inherent nature of big data, data heterogeneity widely exists in real-world recommendation systems. It reflects the differences in the properties among sub-populations. Ignoring the heterogeneity in recommendation data could limit the performance of recommendation models, hurt the sub-populational robustness, and make the models misled by biases. However, data heterogeneity has not attracted substantial attention in the recommendation community. Therefore, it inspires us to adequately explore and exploit heterogeneity for solving the above problems and assisting data analysis. In this work, we focus on exploring two representative categories of heterogeneity in recommendation data that is the heterogeneity of prediction mechanism and covariate distribution and propose an algorithm that explores the heterogeneity through a bilevel clustering method. Furthermore, the uncovered heterogeneity is exploited for two purposes in recommendation scenarios which are prediction with multiple sub-models and supporting debias. Extensive experiments on real-world data validate the existence of heterogeneity in recommendation data and the effectiveness of exploring and exploiting data heterogeneity in recommendation.
翻译:海量数据是数据驱动推荐模型的基础。作为大数据的内在特性,数据异构性广泛存在于现实推荐系统中,反映了子群体之间属性的差异性。忽视推荐数据中的异构性会限制推荐模型的性能,损害子群体鲁棒性,并导致模型受偏差误导。然而,数据异构性尚未引起推荐社区的足够重视。因此,这促使我们充分探索和利用异构性以解决上述问题并辅助数据分析。本文聚焦于推荐数据中两类代表性异构性——预测机制异构性与协变量分布异构性,提出通过双层聚类方法探索异构性的算法。进一步,将发现的异构性用于推荐场景中的两个目标:基于多子模型的预测与支持去偏。在真实数据上的大量实验验证了推荐数据中异构性的存在,以及探索和利用数据异构性的有效性。