Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems

Recommender systems are expected to be assistants that help human users find relevant information automatically without explicit queries. As recommender systems evolve, increasingly sophisticated learning techniques are applied and have achieved better performance in terms of user engagement metrics such as clicks and browsing time. The increase in the measured performance, however, can have two possible attributions: a better understanding of user preferences, and a more proactive ability to utilize human bounded rationality to seduce user over-consumption. A natural following question is whether current recommendation algorithms are manipulating user preferences. If so, can we measure the manipulation level? In this paper, we present a general framework for benchmarking the degree of manipulations of recommendation algorithms, in both slate recommendation and sequential recommendation scenarios. The framework consists of four stages, initial preference calculation, training data collection, algorithm training and interaction, and metrics calculation that involves two proposed metrics. We benchmark some representative recommendation algorithms in both synthetic and real-world datasets under the proposed framework. We have observed that a high online click-through rate does not necessarily mean a better understanding of user initial preference, but ends in prompting users to choose more documents they initially did not favor. Moreover, we find that the training data have notable impacts on the manipulation degrees, and algorithms with more powerful modeling abilities are more sensitive to such impacts. The experiments also verified the usefulness of the proposed metrics for measuring the degree of manipulations. We advocate that future recommendation algorithm studies should be treated as an optimization problem with constrained user preference manipulations.

翻译：推荐系统旨在成为帮助用户无需明确查询即可自动发现相关信息的辅助工具。随着推荐系统的演进，日益复杂的学习技术被应用，并在点击量、浏览时长等用户参与度指标上取得了更佳性能。然而，这种测量性能的提升可能源自两种归因：对用户偏好的更好理解，以及利用人类有限理性诱导用户过度消费的更主动能力。一个自然的后续问题是：当前推荐算法是否正在操纵用户偏好？如果是，我们能否衡量操纵程度？本文提出一个通用框架，用于在榜单推荐和序列推荐两种场景下基准测试推荐算法的操纵程度。该框架包含四个阶段：初始偏好计算、训练数据收集、算法训练与交互，以及涉及两个新提出指标的指标计算。我们在合成数据集和真实数据集上，对若干代表性推荐算法在该框架下进行了基准测试。我们观察到，较高的在线点击率并不一定意味着对用户初始偏好的更好理解，反而导致促使用户选择更多他们最初不偏好的文档。此外，我们发现训练数据对操纵程度有显著影响，且建模能力更强的算法对此影响更敏感。实验也验证了所提指标在衡量操纵程度方面的有效性。我们主张，未来的推荐算法研究应被视为一个受约束的用户偏好操纵优化问题。