Can AI be Easy? Lessons Learned from the EZR.py Toolkit

Much recent press claims that developers no longer need to read code. We disagree, at least within the domain of tabular software-engineering (SE) optimization tasks: rows of $x$ and $y$ values where the $y$ values are expensive to obtain. As evidence we present 400 lines of EZR.py, a Python toolkit (no heavy dependencies) that implements Naive Bayes, $k$-means clustering, classification and regression trees, simulated annealing, local search, active learning, and complementary-Bayes text-mining relevance filtering for tabular SE data. EZR was built by repeatedly reading and refactoring AI tools to simplify and unify them. The result demonstrates that many seemingly different learning algorithms are nearly the same once stripped back to their core: classical algorithms collapse to a few lines each, and a state-of-the-art active learner fits in roughly 80 lines. Tested on the 120+ tabular SE optimization tasks in the MOOT repository, these tiny tools perform as well as or better than state-of-the-art explanation tools (SHAP, LIME), the SMAC3 optimizer, and SVM-based text-mining filters (FASTREAD), while running 500$\times$ faster than SMAC3, using orders of magnitude less labelled data, and building trees from fewer than ten variables even when thousands are available. We conclude that, within the scope of tabular SE optimization, reading and refactoring code is a useful method of generating insight, and small unified toolkits can rival large libraries. EZR is available under an open-source license. Install via \textsf{pip install ezr}; example data at \textsf{github.com/timm/moot}.

翻译：近期众多媒体报道声称开发者不再需要阅读代码。我们对此持不同意见，至少在表格型软件工程优化任务领域内如此：此类任务涉及若干行由$x$和$y$值构成的数据，其中$y$值的获取成本高昂。作为证据，我们展示了仅400行代码的EZR.py——一款无重度依赖的Python工具包，可针对表格型软件工程数据实现朴素贝叶斯、$k$均值聚类、分类与回归树、模拟退火、局部搜索、主动学习，以及互补贝叶斯文本挖掘相关度过滤功能。EZR是通过反复阅读和重构AI工具以简化与统一它们而构建的。结果证明，许多看似不同的学习算法在剥离至核心后几乎完全相同：经典算法每项仅需数行代码，而一个最先进的主动学习器仅用约80行代码即可实现。在MOOT仓库中120余项表格型软件工程优化任务上的测试表明，这些微型工具的表现与最先进的解释工具（SHAP、LIME）、SMAC3优化器及基于SVM的文本挖掘过滤（FASTREAD）相当或更优，同时运行速度比SMAC3快500倍，所需标注数据量减少数个数量级，且在拥有数千个变量时仅需从不足十个变量中构建决策树。我们得出结论：在表格型软件工程优化范围内，阅读和重构代码是生成洞见的有效方法，小型统一工具包能够与大型库相媲美。EZR采用开源许可证发布，可通过\textsf{pip install ezr}安装，示例数据见\textsf{github.com/timm/moot}。