Machine learning models benefit when allowed to learn from temporal trends in time-stamped administrative data. These trends can be represented by dividing a model's observation window into time segments or bins. Model training time and performance can be improved by representing each feature with a different time resolution. However, this causes the time bin size hyperparameter search space to grow exponentially with the number of features. The contribution of this paper is to propose a computationally efficient time series analysis to investigate binning (TAIB) technique that determines which subset of data features benefit the most from time bin size hyperparameter tuning. This technique is demonstrated using hospital and housing/homelessness administrative data sets. The results show that TAIB leads to models that are not only more efficient to train but can perform better than models that default to representing all features with the same time bin size.
翻译:机器学习模型在利用带时间戳的管理数据中的时间趋势时性能更优。通过将观测窗口划分为时间段或时间箱,可有效表征这些时间趋势。为每个特征设置不同的时间分辨率能够提升模型训练效率与表现,但这会导致时间箱大小超参数搜索空间随特征数量呈指数级增长。本文提出一种计算高效的时间序列分析方法(TAIB),用于确定哪些数据特征子集能从时间箱大小超参数调优中获益最大。该方法在医疗与住房/无家可归管理数据集上进行了验证。结果表明,与默认对所有特征使用相同时间箱大小的模型相比,采用TAIB的模型不仅训练效率更高,且性能表现更优。