Detecting fraud and corruption in public procurement remains a major challenge for governments worldwide. Most research to-date builds on domain-knowledge-based corruption risk indicators of individual contract-level features and some also analyzes contracting network patterns. A critical barrier for supervised machine learning is the absence of confirmed non-corrupt, negative, examples, which makes conventional machine learning inappropriate for this task. Using publicly available data on federally funded procurement in Mexico and company sanction records, this study implements positive-unlabeled (PU) learning algorithms that integrate domain-knowledge-based red flags with network-derived features to identify likely corrupt and fraudulent contracts. The best-performing PU model on average captures 32 percent more known positives and performs on average 2.3 times better than random guessing, substantially outperforming approaches based solely on traditional red flags. The analysis of the Shapley Additive Explanations reveals that network-derived features, particularly those associated with contracts in the network core or suppliers with high eigenvector centrality, are the most important. Traditional red flags further enhance model performance in line with expectations, albeit mainly for contracts awarded through competitive tenders. This methodology can support law enforcement in Mexico, and it can be adapted to other national contexts too.
翻译:检测公共采购中的欺诈与腐败仍是全球政府面临的一项重大挑战。现有研究大多基于领域知识构建的腐败风险指标,这些指标关注单个合同层面的特征,部分研究也分析了合同网络模式。监督式机器学习面临的一个关键障碍是缺乏已确认的非腐败(负例)样本,这使得传统机器学习方法不适用于此任务。本研究利用墨西哥联邦资助采购的公开数据及公司制裁记录,实施了正例-未标记(PU)学习算法,将基于领域知识的"危险信号"与网络衍生特征相结合,以识别可能存在腐败和欺诈的合同。表现最佳的PU模型平均多识别32%的已知正例,其性能平均比随机猜测高出2.3倍,显著优于仅基于传统危险信号的方法。对沙普利加性解释的分析表明,网络衍生特征——特别是与网络核心合同或具有高特征向量中心性的供应商相关的特征——最为重要。传统危险标志虽主要对通过竞争性招标授予的合同有效,但如预期那样进一步提升了模型性能。该方法可为墨西哥执法部门提供支持,也可适用于其他国家的具体情境。