This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) under crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data involving large vehicles in South Carolina work zones from 2014 to 2018, which included four times more LS crashes than HS crashes. The objective of this study is to evaluate the crash severity prediction performance of various statistical, machine learning, and deep learning models under different feature selection and data balancing techniques. Findings highlight a disparity in LS and HS predictions, with lower accuracy for HS crashes due to class imbalance and feature overlap. Discriminative Mutual Information (DMI) yields the most effective feature set for predicting HS crashes without requiring data balancing, particularly when paired with gradient boosting models and deep neural networks such as CatBoost, NeuralNetTorch, XGBoost, and LightGBM. Data balancing techniques such as NearMiss-1 maximize HS recall when combined with DMI-selected features and certain models such as LightGBM, making them well-suited for HS crash prediction. Conversely, RandomUnderSampler, HS Class Weighting, and RandomOverSampler achieve more balanced performance, which is defined as an equitable trade-off between LS and HS metrics, especially when applied to NeuralNetTorch, NeuralNetFastAI, CatBoost, LightGBM, and Bayesian Mixed Logit (BML) using merged feature sets or models without feature selection. The insights from this study offer safety analysts guidance on selecting models, feature selection, and data balancing techniques aligned with specific safety goals, providing a robust foundation for enhancing work-zone crash severity prediction.
翻译:本研究探讨了在低严重程度(LS)与高严重程度(HS)事故数据不平衡条件下,涉及大型车辆(即卡车、公共汽车和货车)的施工区事故严重程度风险建模策略。我们使用了2014年至2018年南卡罗来纳州施工区涉及大型车辆的事故数据,其中LS事故数量是HS事故的四倍。本研究的目标是评估不同特征选择与数据平衡技术下,各类统计模型、机器学习模型及深度学习模型在事故严重程度预测方面的性能。研究结果突显了LS与HS预测之间的差异:由于类别不平衡和特征重叠,HS事故的预测准确率较低。判别互信息(DMI)无需数据平衡即可为HS事故预测提供最有效的特征集,尤其当与梯度提升模型及深度神经网络(如CatBoost、NeuralNetTorch、XGBoost和LightGBM)结合使用时。数据平衡技术(如NearMiss-1)与DMI选择的特征及特定模型(如LightGBM)结合时,能最大化HS召回率,使其特别适用于HS事故预测。相反,RandomUnderSampler、HS类别加权和RandomOverSampler能实现更均衡的性能(定义为LS与HS指标间的公平权衡),尤其当应用于NeuralNetTorch、NeuralNetFastAI、CatBoost、LightGBM以及使用合并特征集或无特征选择的贝叶斯混合Logit(BML)模型时。本研究的见解为安全分析师根据具体安全目标选择模型、特征选择及数据平衡技术提供了指导,为提升施工区事故严重程度预测奠定了坚实基础。