Machine Learning Enhanced Multi-Factor Quantitative Trading: A Cross-Sectional Portfolio Optimization Approach with Bias Correction

from arxiv, 18 pages, 12 tables, 6 figures. Code at https://github.com/initial-d/ml-quant-trading. v2: rewritten abstract, added deflated Sharpe analysis (DSR=0.978), expanded limitations section, corrected factor count from 500-1000 to 213

Rolling-window factor pipelines for Chinese A-share markets contain a subtle but costly flaw: daily price-move limits (+/-10% main-board, +/-20% STAR/ChiNext) render a fraction of closing prices non-executable, yet standard implementations ingest these values before any row-filtering runs. The contaminated aggregates propagate silently through moving averages, correlations, and ranks--a failure mode we term "upstream contamination". On real A-share data it inflates apparent information coefficient by 18% while reducing realised Sharpe by 0.44 points, because the model learns to predict returns it cannot trade. We resolve this with a mask-first design: a Boolean tradability mask is constructed at data load time and threaded through every operator, so that no window ever reads a non-tradable price. Built on this foundation, the system adds (i) a GPU-vectorised 213-factor engine via PyTorch unfold primitives (51x over pandas); (ii) an Adjusted-MSE loss penalising wrong-sign predictions 11x more heavily than magnitude errors; (iii) block-bootstrap GBM augmentation; and (iv) Markowitz-Ledoit-Wolf portfolio optimisation with cvxpy warm-start caching. On a calibrated 3,000-stock synthetic panel the system achieves annualised Sharpe 2.05; on proprietary real A-share data (2022-2024) it achieves Sharpe 1.63. Ablation shows the mask contract is the single largest contributor (+0.44), exceeding any model or loss choice. The full implementation is released under MIT licence at https://github.com/initial-d/ml-quant-trading.

翻译：滚动窗口因子流水线在处理中国A股市场数据时存在一个微妙但代价高昂的缺陷：每日价格涨跌停限制（主板±10%，科创板/创业板±20%）导致部分收盘价无法实际成交，然而标准实现会在执行任何行过滤之前直接纳入这些数值。被污染的聚合统计量会通过移动平均值、相关系数和排序等环节暗中传播——我们将这种失效模式称为"上游污染"。在真实A股数据上，该污染使表观信息系数虚增18%，同时实际夏普比率下降0.44个点，因为模型学会了预测无法交易的收益。我们通过"掩码优先"设计解决该问题：在数据加载时构建布尔型可交易性掩码，并将其贯穿所有算子，确保任何窗口都不会读取不可交易的价格。在此基础之上，该系统加入了：(i) 基于PyTorch unfold原语的GPU向量化213因子引擎（效率为pandas的51倍）；(ii) 调整后的均方误差损失函数，对符号预测错误的惩罚比幅度误差高11倍；(iii) 块自助法几何布朗运动数据增强；以及(iv) 采用cvxpy热启动缓存的Markowitz-Ledoit-Wolf组合优化方法。在校准的3000只股票合成面板上，该系统实现了年化夏普比率2.05；在专有真实A股数据（2022-2024年）上实现了夏普比率1.63。消融实验表明，掩码约束是贡献最大的单一因素（+0.44），超越了任何模型或损失函数的选择。完整实现以MIT许可证发布在https://github.com/initial-d/ml-quant-trading。