Effective e-commerce risk management requires in-depth case investigations to identify emerging fraud patterns in highly adversarial environments. However, manual investigation typically requires analyzing the associations and couplings among multi-source heterogeneous data, a labor-intensive process that limits efficiency. While Large Language Models (LLMs) show promise in automating these analyses, their deployment is hindered by the complexity of risk scenarios and the sparsity of long-tail domain knowledge. To address these challenges, we propose Sherlock, a framework that integrates structured domain knowledge with LLM-based reasoning through three core modules. First, we construct a domain Knowledge Base (KB) by distilling structured expertise from heterogeneous knowledge sources. Second, we design a two-stage retrieval-augmented generation strategy tailored for case investigation, which combines input contextual augmentation with a Reflect & Refine module to fully leverage the KB for improved analysis quality. Finally, we develop an integrated platform for operations and annotation to drive a self-evolving data flywheel. By combining real-time hotfixes through KB updates with periodic logic alignment via post-training, we facilitate continuous system evolution to counteract adversarial drifts. Online A/B tests at JD dot com demonstrate that Sherlock achieves an 82% Expert Acceptance Rate (EAR) and a 386.7% increase in daily investigation throughput. An additional 90-day evaluation shows that the flywheel successfully recovers from performance decay caused by changing tactics twice, raising the EAR ceiling by around 3.5% through autonomous model updates.
翻译:有效的电商风险管理需要在高度对抗的环境中开展深度案例调查,以识别新兴欺诈模式。然而,人工调查通常需要分析多源异构数据间的关联与耦合关系,这种劳动密集型过程严重制约了效率。尽管大语言模型在自动化分析方面展现出潜力,但其部署仍受限于风险场景的复杂性与长尾领域知识的稀疏性。为应对这些挑战,我们提出Sherlock框架,通过三个核心模块将结构化领域知识与基于大模型的推理能力相融合。首先,我们从异构知识源中提炼结构化专家经验,构建领域知识库;其次,针对案例调查场景设计两阶段检索增强生成策略,通过融合输入上下文增强与反思优化模块充分挖掘知识库潜力以提升分析质量;最后,我们开发了一体化运营与标注平台,驱动自演进数据飞轮。通过知识库更新的实时热修复与后训练阶段周期性逻辑对齐,我们实现了系统的持续演化以对抗对抗性漂移。在京东的在线A/B测试中,Sherlock实现了82%的专家接受率与每日调查吞吐量386.7%的增长。额外90天评估表明,该数据飞轮能够成功从两次因策略变化导致的性能衰退中恢复,并通过自主模型更新将专家接受率上限提升约3.5%。