Multi-competitor races often feature complicated within-race strategies that are difficult to capture when training data on race outcome level data. Further, models which do not account for such strategic effects may suffer from confounded inferences and predictions. In this work we develop a general generative model for multi-competitor races which allows analysts to explicitly model certain strategic effects such as changing lanes or drafting and separate these impacts from competitor ability. The generative model allows one to simulate full races from any real or created starting position which opens new avenues for attributing value to within-race actions and to perform counter-factual analyses. This methodology is sufficiently general to apply to any track based multi-competitor races where both tracking data is available and competitor movement is well described by simultaneous forward and lateral movements. We apply this methodology to one-mile horse races using data provided by the New York Racing Association (NYRA) and the New York Thoroughbred Horsemen's Association (NYTHA) for the Big Data Derby 2022 Kaggle Competition. This data features granular tracking data for all horses at the frame-level (occurring at approximately 4hz). We demonstrate how this model can yield new inferences, such as the estimation of horse-specific speed profiles which vary over phases of the race, and examples of posterior predictive counterfactual simulations to answer questions of interest such as starting lane impacts on race outcomes.
翻译:多竞争赛事通常包含复杂的赛内策略,这些策略在基于赛事结果层面数据训练时难以捕获。此外,未能考虑此类策略效应的模型可能产生混杂的推断和预测。本研究提出了一种通用的多竞争赛事生成模型,使分析人员能够显式建模特定策略效应(如变道或跟跑),并将这些影响与选手能力分离。该生成模型可从任意真实或虚拟的起始位置模拟完整赛事,为量化赛内行动价值及进行反事实分析开辟了新途径。该方法具有充分通用性,可应用于任何基于赛道、且具备轨迹数据且选手运动可由同步前进和横向运动良好描述的多竞争赛事。我们利用纽约赛马协会(NYRA)和纽约纯种马主协会(NYTHA)为2022年大数据德比Kaggle竞赛提供的数据,将其应用于一英里赛马赛事。该数据包含所有赛马逐帧级(约4Hz采样)的细粒度轨迹数据。我们展示了该模型如何产生新推断,例如估计随赛事阶段变化的赛马个体速度曲线,并通过后验预测反事实仿真实例解答起始赛道对赛事结果影响等关键问题。