Implementation Risk in Portfolio Backtesting: A Previously Unquantified Source of Error

Portfolio backtesting is the primary tool for evaluating investment strategies before deployment, yet practitioners implicitly assume that different engines produce identical results for the same strategy. we formalise implementation risk, the systematic divergence in backtested portfolio metrics arising solely from differences in how engines implement the same logical strategy, and propose four metrics grounded in metrology to quantify it: engine sensitivity, implementation uncertainty interval, divergence amplification factor, and conclusion stability index. we execute 15 benchmark strategies through five independent open-source engines on 30 non-overlapping stratified asset buckets comprising 180 s&p 500 stocks under four transaction-cost regimes. at zero cost, all five engines agree exactly (maximum divergence 0.000%), isolating transaction-cost implementation as the sole source of disagreement. under nonzero costs, divergence is structured and predictable (spearman rho = 0.93 with cost intensity), remaining below 0.75 percentage points for most strategies but reaching 3.71% for high-turnover rotation strategies. source-code forensics uncovered seven previously undocumented defects across three engines, abstracted into a five-category failure-mode taxonomy. all engines agree on the sign of every performance metric (conclusion stability index = 1), so implementation risk does not alter investment decisions for the strategies studied but introduces measurable ambiguity in performance attribution. code and benchmark data are publicly available.

翻译：投资组合回测是部署前评估投资策略的主要工具，然而实践者通常假设不同引擎对同一策略会生成完全相同的结果。我们正式提出了"实现风险"这一概念——即仅因引擎实现同一逻辑策略的方式差异而导致的回测投资组合指标系统性偏差——并基于计量学提出了四个量化指标：引擎敏感度、实现不确定区间、发散放大因子及结论稳定性指数。我们在四种交易成本制度下，通过五个独立开源引擎对包含180只标普500股票的30个非重叠分层资产篮子执行了15个基准策略。在零成本条件下，五个引擎结果完全一致（最大偏差0.000%），从而将交易成本实现方式孤立为分歧的唯一来源。在非零成本下，分歧呈现结构化和可预测特征（与成本强度的斯皮尔曼相关系数达0.93），大多数策略分歧低于0.75个百分点，但高换手率轮动策略可达3.71%。通过对三个引擎进行源代码取证，发现了七项此前未记录的缺陷，并将其抽象为包含五个类别的失效模式分类体系。所有引擎对每项业绩指标的符号判断完全一致（结论稳定性指数=1），因此本研究中涉及的实现风险虽未改变投资决策，但在业绩归因中引入了可量化的模糊性。相关代码与基准数据已公开提供。