Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17--0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88\% by using context (5--10 items) instead of longer histories (50 items), without sacrificing recommendation quality. We also analyze latency patterns across providers and find model specific behaviors that inform deployment decisions. This work challenges the existing ``more context is better'' paradigm and provides actionable guidelines for cost effective LLM based recommendation systems.
翻译:大型语言模型(LLM)正日益被应用于个性化产品推荐,从业者通常认为更长的用户购买历史能带来更好的预测。我们通过系统性地对四种前沿LLM(GPT-4o-mini、DeepSeek-V3、Qwen2.5-72B和Gemini 2.5 Flash)在使用REGEN数据集、上下文长度从5到50个项目的范围内进行基准测试,对这一假设提出了挑战。令人惊讶的是,我们在50名用户采用被试内设计的实验中发现,增加上下文长度并未带来显著的性能提升。所有实验条件下的质量评分均保持平稳(0.17-0.23)。我们的研究结果具有重要的实践意义:从业者可通过使用较短上下文(5-10个项目)替代长历史记录(50个项目),在保持推荐质量不变的同时将推理成本降低约88%。我们还分析了不同服务提供商的延迟模式,发现了影响部署决策的模型特异性行为。这项工作挑战了现有的“上下文越多越好”范式,并为构建高性价比的基于LLM的推荐系统提供了可操作的指导原则。