In order to fully harness the potential of machine learning, it is crucial to establish a system that renders the field more accessible and less daunting for individuals who may not possess a comprehensive understanding of its intricacies. The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation to provide a great UX design for users. The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability. The paper proposes two novel classifiers, Logistic Regression Forest and Support Vector Tree, for enhanced model performance, achieving 96\% accuracy on a diabetes dataset and 93\% on a survey dataset. The paper also introduces a model-dependent local interpreter called MEDLEY and evaluates its interpretation against LIME, Greedy, and Parzen. Additionally, the paper introduces LLM-based synthetic data generation, library-based data generation, and enhancing the original dataset with GAN. The findings on synthetic data suggest that enhancing the original dataset with GAN is the most reliable way to generate synthetic data, as evidenced by KS tests, standard deviation, and feature importance. The authors also found that GAN works best for quantitative datasets.
翻译:为充分释放机器学习潜力,关键在于构建一个降低准入门槛、减轻非专业人士对复杂算法认知压力的系统。本文描述了一种整合自动机器学习(AutoML)、可解释人工智能(XAI)与合成数据生成的系统设计,旨在提供优质用户体验。该系统允许用户在屏蔽底层复杂性的同时导航并利用机器学习能力,实现高可用性。本文提出两种新型分类器——逻辑回归森林(Logistic Regression Forest)与支持向量树(Support Vector Tree)——以提升模型性能,在糖尿病数据集上达到96%的准确率,在调查数据集上达到93%的准确率。此外,本文提出一种模型依赖型局部解释器MEDLEY,并与LIME、Greedy及Parzen方法进行了解释效果对比评估。同时引入基于大语言模型(LLM)的合成数据生成、基于库的数据生成,以及通过生成对抗网络(GAN)增强原始数据集的方法。基于合成数据的实验结果表明,通过KS检验、标准差及特征重要性分析证实,采用GAN增强原始数据集是最可靠的合成数据生成方式。作者还发现GAN在定量数据集上表现最佳。