Multivariate time series (MTS) are frequently affected by co-occurring quality issues, such as missing values, outliers, and constraint violations, which significantly undermine downstream analytics. Existing cleaning approaches fix only a limited set of such issues, making them ill-suited for scenarios where multiple quality problems arise simultaneously. Furthermore, these methods commonly depend on the availability of ground truth data or domain-specific rules, both of which are rarely accessible in real-world applications. In this paper, we introduce \sys, an agent system with reinforcement learning designed to clean multiple data quality issues in MTS. We cast the cleaning process as a joint optimization problem that simultaneously handles quality issue order and cleaning model selection, allowing efficient navigation of the large space of possible cleaning pipelines. Our framework relies on a hierarchical agent architecture, where a high-level agent determines the order in which data quality issues should be processed, while a low-level agent identifies the most suitable cleaning method for each issue. To guide the agent toward an optimal cleaning pipeline, we propose a dual-stage reward mechanism that couples upstream (cleaning) and downstream performance, enabling effective optimization without relying on ground truth. Our experimental results show that \sys consistently outperforms existing methods, achieving up to 96\% improvement in data cleaning quality and 27\% improvement in downstream performance.
翻译:多变量时间序列(MTS)常受到同时出现的多个质量问题影响,例如缺失值、异常值和约束违反,这些问题会严重损害下游分析。现有的清洗方法仅能修复其中有限的问题类型,难以应对多种质量问题同时发生的情况。此外,这些方法通常依赖于真实标注数据或领域特定规则的存在,而在实际应用中两者往往难以获取。本文提出了一种基于强化学习的智能体系统 \sys,旨在清洗多变量时间序列中的多重数据质量问题。我们将清洗过程建模为一个联合优化问题,该问题同时处理质量问题顺序和清洗模型选择,从而实现对庞大清洗流水线空间的高效遍历。我们的框架基于层次化智能体架构:高层智能体决定质量问题的处理顺序,而低层智能体则为每个问题选择最合适的清洗方法。为引导智能体找到最优清洗流水线,我们提出了一种双阶段奖励机制,该机制耦合了上游(清洗)性能与下游性能,从而在不依赖真实标注数据的情况下实现有效优化。实验结果表明,\sys 在数据清洗质量上最高提升96%,在下游性能上最高提升27%,持续优于现有方法。