Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.
翻译:美国抵押贷款领域持续存在种族与性别差异。我们研究了标准数据预处理步骤(特别是属性分箱)是否会放大下游模式挖掘中的这些差异。基于HMDA 2023数据集(芝加哥大都市区)中103,481条清洗后的抵押贷款申请记录,我们构建了三阶段流水线:(1)PySpark数据清洗与分箱流水线,同时实现标准等频分箱与Asudeh等人[1]提出的ε偏倚公平分箱算法;(2)FP-增长关联规则挖掘,对比两种分箱机制下的拒绝模式;(3)K-均值聚类与逐簇差异性影响审计。标准分箱在收入离散化中显示出9.63%的种族偏倚,与先前研究报道的8-10%一致。针对七个种族组的公平分箱在ε=0.03时不可行,仅在ε=0.08时以29.4%的公平代价成功实现。FP-增长揭示高债务收入比是主导的拒绝预测因子(置信度67.2%,提升度2.81),而种族偏倚未作为显性高支持度规则出现。然而,K-均值聚类结合差异性影响审计标记出45个簇-组对中的10对,显示即使在财务相似群体中,黑人申请者的拒绝率也显著高于白人申请者。