Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced strategies.We first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.
翻译:特征变换对于经典机器学习至关重要,其目标是从数据中心的视角生成特征组合以增强下游任务的性能。当前的方法,如人工专家驱动流程、迭代反馈技术和探索生成策略,已显示出通过最小化人工参与来自动化此类数据工程工作流的潜力。然而,这些框架仍存在三个挑战:(1) 评估主要依赖下游任务性能指标,这非常耗时,尤其对于大型数据集。(2) 随机探索结束后,特征组合的多样性难以保证。(3) 罕见的重要变换导致稀疏的有价值反馈,阻碍学习过程或导致效果不佳。针对这些挑战,我们提出了FastFT,一个利用三重高级策略的创新框架。我们首先通过性能预测器将特征变换评估与生成数据集的结果解耦。为解决奖励稀疏性问题,我们开发了一种评估生成变换序列新颖性的方法。将这种新颖性纳入奖励函数,加速了模型对有效变换的探索,从而提高了搜索效率。此外,我们结合新颖性和性能创建了优先记忆缓冲区,确保在探索过程中有效重访关键经验。我们广泛的实验评估验证了所提框架的性能、效率和可追溯性,展示了其在处理复杂特征变换任务上的优越性。