Train delays result from complex interactions between operational, technical, and environmental factors. While weather impacts railway reliability, particularly in Nordic regions, existing datasets rarely integrate meteorological information with operational train data. This study presents the first publicly available dataset combining Finnish railway operations with synchronized meteorological observations from 2018-2024. The dataset integrates operational metrics from Finland Digitraffic Railway Traffic Service with weather measurements from 209 environmental monitoring stations, using spatial-temporal alignment via Haversine distance. It encompasses 28 engineered features across operational variables and meteorological measurements, covering approximately 38.5 million observations from Finland's 5,915-kilometer rail network. Preprocessing includes strategic missing data handling through spatial fallback algorithms, cyclical encoding of temporal features, and robust scaling of weather data to address sensor outliers. Analysis reveals distinct seasonal patterns, with winter months exhibiting delay rates exceeding 25\% and geographic clustering of high-delay corridors in central and northern Finland. Furthermore, the work demonstrates applications of the data set in analysing the reliability of railway traffic in Finland. A baseline experiment using XGBoost regression achieved a Mean Absolute Error of 2.73 minutes for predicting station-specific delays, demonstrating the dataset's utility for machine learning applications. The dataset enables diverse applications, including train delay prediction, weather impact assessment, and infrastructure vulnerability mapping, providing researchers with a flexible resource for machine learning applications in railway operations research.
翻译:列车延误源于运营、技术和环境因素间的复杂相互作用。尽管天气条件(尤其在北欧地区)会影响铁路可靠性,但现有数据集很少将气象信息与列车运营数据相结合。本研究首次提出了一个公开可用的数据集,该数据集整合了2018年至2024年芬兰铁路运营数据与同步气象观测数据。该数据集通过哈弗辛距离进行时空对齐,将芬兰Digitraffic铁路交通服务的运营指标与209个环境监测站的气象测量数据相结合。它包含运营变量和气象测量中的28个工程特征,覆盖芬兰5,915公里铁路网络中约3,850万条观测记录。预处理包括通过空间回退算法处理缺失数据、对时间特征进行循环编码以及对气象数据进行稳健缩放以应对传感器异常值。分析揭示了明显的季节性模式:冬季月份的延误率超过25%,且高延误走廊在芬兰中部和北部呈现地理聚集性。此外,本研究展示了该数据集在分析芬兰铁路交通可靠性方面的应用。使用XGBoost回归的基线实验在预测站点特定延误时实现了2.73分钟的平均绝对误差,证明了该数据集在机器学习应用中的实用性。该数据集支持多种应用,包括列车延误预测、天气影响评估和基础设施脆弱性制图,为研究人员在铁路运营研究中开展机器学习应用提供了灵活的资源。