数据不平衡条件下的事故严重程度风险建模策略 (Crash Severity Risk Modeling Strategies under Data Imbalance)

from arxiv, This second revised version has been resubmitted to the Transportation Research Record: Journal of the Transportation Research Board after addressing the reviewers' comments and is currently awaiting the final decision

This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) under crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data involving large vehicles in South Carolina work zones from 2014 to 2018, which included four times more LS crashes than HS crashes. The objective of this study is to evaluate the crash severity prediction performance of various statistical, machine learning, and deep learning models under different feature selection and data balancing techniques. Findings highlight a disparity in LS and HS predictions, with lower accuracy for HS crashes due to class imbalance and feature overlap. Discriminative Mutual Information (DMI) yields the most effective feature set for predicting HS crashes without requiring data balancing, particularly when paired with gradient boosting models and deep neural networks such as CatBoost, NeuralNetTorch, XGBoost, and LightGBM. Data balancing techniques such as NearMiss-1 maximize HS recall when combined with DMI-selected features and certain models such as LightGBM, making them well-suited for HS crash prediction. Conversely, RandomUnderSampler, HS Class Weighting, and RandomOverSampler achieve more balanced performance, which is defined as an equitable trade-off between LS and HS metrics, especially when applied to NeuralNetTorch, NeuralNetFastAI, CatBoost, LightGBM, and Bayesian Mixed Logit (BML) using merged feature sets or models without feature selection. The insights from this study offer safety analysts guidance on selecting models, feature selection, and data balancing techniques aligned with specific safety goals, providing a robust foundation for enhancing work-zone crash severity prediction.

翻译：本研究探讨了在低严重程度（LS）与高严重程度（HS）事故数据不平衡条件下，涉及大型车辆（即卡车、公共汽车和货车）的施工区事故严重程度风险建模策略。我们使用了2014年至2018年南卡罗来纳州施工区涉及大型车辆的事故数据，其中LS事故数量是HS事故的四倍。本研究的目标是评估不同特征选择与数据平衡技术下，各类统计模型、机器学习模型及深度学习模型在事故严重程度预测方面的性能。研究结果突显了LS与HS预测之间的差异：由于类别不平衡和特征重叠，HS事故的预测准确率较低。判别互信息（DMI）无需数据平衡即可为HS事故预测提供最有效的特征集，尤其当与梯度提升模型及深度神经网络（如CatBoost、NeuralNetTorch、XGBoost和LightGBM）结合使用时。数据平衡技术（如NearMiss-1）与DMI选择的特征及特定模型（如LightGBM）结合时，能最大化HS召回率，使其特别适用于HS事故预测。相反，RandomUnderSampler、HS类别加权和RandomOverSampler能实现更均衡的性能（定义为LS与HS指标间的公平权衡），尤其当应用于NeuralNetTorch、NeuralNetFastAI、CatBoost、LightGBM以及使用合并特征集或无特征选择的贝叶斯混合Logit（BML）模型时。本研究的见解为安全分析师根据具体安全目标选择模型、特征选择及数据平衡技术提供了指导，为提升施工区事故严重程度预测奠定了坚实基础。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日