Utilizing administrative data to predict outcomes is an important application area of machine learning, particularly in healthcare. Most administrative data records are timestamped and the pattern of records over time is a key input for machine learning models. This paper explores how best to divide the observation window of a machine learning model into time segments or "bins". A computationally efficient process is presented that identifies which data features benefit most from smaller, higher resolution time segments. Results generated on healthcare and housing/homelessness administrative data demonstrate that optimizing the time bin size of these high priority features while using a single time bin for the other features achieves machine learning models that are simpler and quicker to train. This approach also achieves similar and sometimes better performance than more complex models that default to representing all data features with the same time resolution.
翻译:利用管理数据预测结果是机器学习的重要应用领域,尤其在医疗保健领域。大多数管理数据记录都带有时间戳,记录随时间变化的模式是机器学习模型的关键输入。本文探讨如何最佳地将机器学习模型的观测窗口划分为时间段或“时间箱”。文中提出了一种计算高效的方法,用于识别哪些数据特征能从更小、更高分辨率的时间段中获益最多。在医疗保健与住房/无家可归管理数据上生成的结果表明,优化这些高优先级特征的时间箱大小,同时为其他特征使用单一时间箱,能够获得更简单且训练更快的机器学习模型。这种方法在性能上通常与默认对所有数据特征采用相同时间分辨率的复杂模型相当,有时甚至更优。