This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, we investigate the impact of prompt design strategies-including structure, system role, and intent-on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL). In the first experiment, we assess seven distinct prompt scenarios on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as "act as a fair recommender," proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The second experiment explores sequential ICL, comparing zero-shot and few-shot ICL. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit rate (HR) when age-group context was included. Our study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, we highlight the potential and challenges of integrating LLMs into recommendation systems. Further details can be found at https://github.com/yasdel/Benchmark_RecLLM_Fairness.
翻译:本文探究基于ChatGPT的推荐系统中存在的偏见,重点关注提供者公平性(项目侧公平性)。通过大量实验及上千次API调用,我们研究了提示设计策略——包括结构、系统角色与意图——对提供者公平性、目录覆盖率、时间稳定性及时效性等评估指标的影响。首个实验在经典top-K推荐任务中检验这些策略,第二个实验则评估序列上下文学习(ICL)。在第一个实验中,我们在top-K推荐的准确性与公平性上评估了七种不同的提示场景。以准确性为导向的提示(如简单提示和思维链提示)在性能上优于多样化提示;后者虽能提升时间新鲜度,却会使准确性降低高达50%。将公平性嵌入系统角色(例如“扮演公平推荐者”)被证明比在提示中加入公平性指令更为有效。与传统协同过滤模型相比,多样化提示倾向于推荐较新的电影,并提供更广泛的类型分布。第二个实验探索序列ICL,比较零样本与少样本ICL。结果表明,在提示中包含用户人口统计信息会影响模型的偏见与刻板印象。然而,与零样本学习相比,ICL并未在项目公平性和目录覆盖率上展现出持续改进。零样本学习取得了更高的NDCG与覆盖率,而包含年龄组上下文时,ICL-2在命中率上略有提升。本研究为理解推荐大语言模型(RecLLM)的偏见,特别是提供者公平性与目录覆盖率方面提供了见解。通过考察提示设计、学习策略与系统角色,我们揭示了将大语言模型整合至推荐系统中的潜力与挑战。更多细节请访问https://github.com/yasdel/Benchmark_RecLLM_Fairness。