We introduce small-text, an easy-to-use active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It features numerous pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid and convenient development of both active learning experiments and applications. With the objective of making various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. Using this new library, we investigate the performance of the recently published SetFit training paradigm, which we compare to vanilla transformer fine-tuning, finding that it matches the latter in classification accuracy while outperforming it in area under the curve. The library is available under the MIT License at https://github.com/webis-de/small-text, in version 1.3.0 at the time of writing.
翻译:我们提出小文本(small-text)——一个易于使用的主动学习库,支持Python中基于池的主动学习,适用于单标签和多标签文本分类。该库包含多种预实现的最新查询策略,其中部分策略可利用GPU加速。标准化的接口允许组合多种分类器、查询策略与停止准则,便于快速混合搭配,从而促进主动学习实验与应用的高效开发。为使各类分类器和查询策略在主动学习中易于使用,小文本集成了多个知名机器学习库,包括scikit-learn、PyTorch和Hugging Face transformers。这些集成作为可选安装扩展,既能支持GPU也可不依赖GPU运行。通过该新库,我们研究了近期发布的SetFit训练范式的性能,并将其与标准Transformer微调进行对比,发现前者在分类准确率上与后者相当,但在曲线下面积(AUC)指标上表现更优。该库采用MIT许可证发布,当前版本为1.3.0,访问地址:https://github.com/webis-de/small-text。