Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.
翻译:音乐流媒体欺诈行为(恶意行为者通过人为增加播放量以操控排行榜排名与版税分配)对流媒体服务和合法内容创作者构成重大威胁。传统欺诈检测方法面临关键挑战:包括超级粉丝和助眠音乐会话在内的众多合法边缘案例,其活动模式与协同欺诈高度相似。本文提出SAGE——一种新颖的反事实感知负样本抽取方法,通过结合SimHash分层采样与模块化门控集成,从未标注数据中实现可信负样本识别。该集成架构采用可插拔统计门(当前实例化为马氏距离与k近邻密度)并配备可配置投票阈值,支持自适应精度-召回权衡。通过下限约束采样确保对罕见行为群体的全面覆盖,解决了正负样本不平衡学习中的表征偏差问题。评估表明,该方法在保留数据集上实现了出色的精确率与召回率。本方法可泛化至欺诈检测各领域,在无需修改核心框架的情况下,对客户级和艺术家级欺诈均取得优异性能。