The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance. The proposed approach's feasibility is demonstrated by speeding up the state-of-the-art AutoML system on a synthetic data set with no performance loss. Furthermore, we considered a real-life click-through-rate prediction data set where it outperformed strong baselines such as random forest-based approaches. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches, enabling faster search for better models on off-the-shelf hardware.
翻译:现代推荐系统的设计依赖于理解特征空间中哪些部分对解决特定推荐任务具有相关性。然而,该领域的真实数据集通常具有大规模、稀疏性和噪声等特征,使得识别有效信号极具挑战性。特征排序作为一种高效的算法分支,可通过识别最具信息量的特征并促进更紧凑且性能更优模型的自动搜索(AutoML)来应对上述挑战。我们提出OutRank系统,这是一种多功能特征排序与数据质量异常检测方法。OutRank专门针对类别型数据设计,采用基于互信息的变体方法,通过同一基数特征产生的噪声对互信息进行归一化处理。我们进一步扩展相似性度量方法,融合特征相似性与组合相关性信息。在合成数据集上的实验表明,该方案能在无性能损失的前提下加速最先进的AutoML系统。此外,我们在真实点击率预测数据集上的测试显示,该方法优于随机森林等强基线方法。与纯AutoML方法相比,本方案可探索规模扩大至300%的特征空间,从而在常规硬件上实现更优模型的高速搜索。