Deep learning-based (DL) models in recommender systems (RecSys) have gained significant recognition for their remarkable accuracy in predicting user preferences. However, their performance often lacks a comprehensive evaluation from a human-centric perspective, which encompasses various dimensions beyond simple interest matching. In this work, we have developed a robust human-centric evaluation framework that incorporates seven diverse metrics to assess the quality of recommendations generated by five recent open-sourced DL models. Our evaluation datasets consist of both offline benchmark data and personalized online recommendation feedback collected from 445 real users. We find that (1) different DL models have different pros and cons in the multi-dimensional metrics that we test with; (2) users generally want a combination of accuracy with at least one another human values in the recommendation; (3) the degree of combination of different values needs to be carefully experimented to user preferred level.
翻译:基于深度学习(DL)的推荐系统(RecSys)模型在预测用户偏好方面以其卓越的准确性而获得显著认可。然而,其性能往往缺乏从人本视角进行的全面评估,这一视角涵盖超越简单兴趣匹配的多个维度。在本工作中,我们构建了一个稳健的人本评估框架,该框架包含七个不同的评估指标,用于衡量五种近期开源深度学习模型生成的推荐质量。我们的评估数据集包括离线基准数据以及从445名真实用户收集的个性化在线推荐反馈。我们发现:(1)不同的深度学习模型在我们测试的多维指标上各有优劣;(2)用户通常希望在推荐中实现准确性与至少另一项人类价值观的结合;(3)不同价值观的组合程度需要仔细实验以达至用户偏好的水平。