Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,
翻译:集成方法常被用于推荐系统,通过组合多个模型来提升准确率。近期研究报告了显著的性能提升,但多数研究仍主要针对准确性和鲁棒性进行优化,而非能效。本文衡量了集成技术相对于强单模型的准确率-能耗权衡。我们在两个流程中开展了93项受控实验:1. 使用Surprise(RMSE)进行显式评分预测;2. 使用LensKit(NDCG@10)进行隐式反馈排序。我们评估了四个数据集,交互量从10万到780万不等(MovieLens 100K、MovieLens 1M、ModCloth、Anime)。我们比较了四种集成策略(平均法、加权法、堆叠或排名融合、最佳模型组合)与基线及优化后的单模型。使用EMERS通过智能插头测量系统整体能耗,并转换为二氧化碳当量。在不同设置下,集成方法将准确率提升0.3%至5.7%,同时能耗增加19%至2549%。在MovieLens 1M上,最佳模型组合集成相对于SVD++以18.8%的能耗开销将RMSE提升0.96%。在MovieLens 100K上,平均集成法以103%的额外能耗将NDCG@10提升5.7%。在Anime上,Surprise最佳模型组合集成将RMSE提升1.2%,但能耗增加2005%(0.21瓦时对比0.01瓦时),排放量从2.6毫克二氧化碳当量增至53.8毫克,而LensKit集成因内存限制失败。总体而言,选择性集成比穷举平均法更具能效优势。