Rolling-window factor pipelines for Chinese A-share markets contain a subtle but costly flaw: daily price-move limits (+/-10% main-board, +/-20% STAR/ChiNext) render a fraction of closing prices non-executable, yet standard implementations ingest these values before any row-filtering runs. The contaminated aggregates propagate silently through moving averages, correlations, and ranks--a failure mode we term "upstream contamination". On real A-share data it inflates apparent information coefficient by 18% while reducing realised Sharpe by 0.44 points, because the model learns to predict returns it cannot trade. We resolve this with a mask-first design: a Boolean tradability mask is constructed at data load time and threaded through every operator, so that no window ever reads a non-tradable price. Built on this foundation, the system adds (i) a GPU-vectorised 213-factor engine via PyTorch unfold primitives (51x over pandas); (ii) an Adjusted-MSE loss penalising wrong-sign predictions 11x more heavily than magnitude errors; (iii) block-bootstrap GBM augmentation; and (iv) Markowitz-Ledoit-Wolf portfolio optimisation with cvxpy warm-start caching. On a calibrated 3,000-stock synthetic panel the system achieves annualised Sharpe 2.05; on proprietary real A-share data (2022-2024) it achieves Sharpe 1.63. Ablation shows the mask contract is the single largest contributor (+0.44), exceeding any model or loss choice. The full implementation is released under MIT licence at https://github.com/initial-d/ml-quant-trading.
翻译:滚动窗口因子流水线在处理中国A股市场数据时存在一个微妙但代价高昂的缺陷:每日价格涨跌停限制(主板±10%,科创板/创业板±20%)导致部分收盘价无法实际成交,然而标准实现会在执行任何行过滤之前直接纳入这些数值。被污染的聚合统计量会通过移动平均值、相关系数和排序等环节暗中传播——我们将这种失效模式称为"上游污染"。在真实A股数据上,该污染使表观信息系数虚增18%,同时实际夏普比率下降0.44个点,因为模型学会了预测无法交易的收益。我们通过"掩码优先"设计解决该问题:在数据加载时构建布尔型可交易性掩码,并将其贯穿所有算子,确保任何窗口都不会读取不可交易的价格。在此基础之上,该系统加入了:(i) 基于PyTorch unfold原语的GPU向量化213因子引擎(效率为pandas的51倍);(ii) 调整后的均方误差损失函数,对符号预测错误的惩罚比幅度误差高11倍;(iii) 块自助法几何布朗运动数据增强;以及(iv) 采用cvxpy热启动缓存的Markowitz-Ledoit-Wolf组合优化方法。在校准的3000只股票合成面板上,该系统实现了年化夏普比率2.05;在专有真实A股数据(2022-2024年)上实现了夏普比率1.63。消融实验表明,掩码约束是贡献最大的单一因素(+0.44),超越了任何模型或损失函数的选择。完整实现以MIT许可证发布在https://github.com/initial-d/ml-quant-trading。