Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.

翻译：流式持续学习通常通过时间分区将连续数据流转化为离散任务序列。我们认为，这一时间任务化步骤并非中性的预处理选择，而是评估的结构性组成部分：同一数据流的不同有效划分可能引发不同的持续学习机制，进而导致不同的基准测试结论。为研究这一效应，我们提出了基于可塑性与稳定性特征的任务化层级框架、任务化间的特征距离，以及边界-特征敏感性指标，该指标可在训练任何持续学习模型前诊断边界扰动如何改变诱导机制。我们利用CESNET-Timeseries24数据集对网络流量预测任务中的连续微调、经验回放、弹性权重巩固和无遗忘学习进行评估，在保持数据流、模型和训练预算不变的情况下，仅改变时间任务化。在9天、30天和44天的划分中，我们观察到预测误差、遗忘和反向迁移的显著变化，表明仅任务化本身就能实质性地影响持续学习评估。我们进一步发现，较短的任务化会引发更嘈杂的分布级模式、更大的结构距离和更高的边界-特征敏感性，表明其对边界扰动更为敏感。这些结果表明，流式持续学习中的基准测试结论不仅取决于学习器和数据流，还取决于数据流的任务化方式，这促使我们将时间任务化视为一等评估变量。