Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability. In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry.
翻译:推荐系统的离线评估常常受到数据准备过程中隐藏且文档记录不足的选择的影响。在过滤、重复项处理、冷启动处理以及划分策略设计等方面看似微小的决策,都可能显著改变模型排序,并损害可复现性与跨论文可比性。本文介绍SplitLight,一个开源的探索性工具包,旨在帮助设计预处理与划分流程的研究者与实践者,或审查外部产出的用户,使这些决策变得可测量、可比较且可报告。给定一个交互日志及其导出的划分子集,SplitLight能够分析核心及时间相关的数据集统计量,刻画重复消费模式与时间戳异常,并诊断划分的有效性,包括时间泄漏、冷用户/物品暴露以及分布偏移。SplitLight还允许通过全面的聚合摘要与交互式可视化,对不同的划分策略进行并排比较。SplitLight以Python工具包和交互式无代码界面两种形式提供,能够生成审计摘要,为评估方案提供依据,并支持推荐系统研究与工业实践中透明、可靠且可比较的实验。