Many NLP researchers rely on free computational services, such as Google Colab, to fine-tune their Transformer models, causing a limitation for hyperparameter optimization (HPO) in long-text classification due to the method having quadratic complexity and needing a bigger resource. In Indonesian, only a few works were found on long-text classification using Transformers. Most only use a small amount of data and do not report any HPO. In this study, using 18k news articles, we investigate which pretrained models are recommended to use based on the output length of the tokenizer. We then compare some hacks to shorten and enrich the sequences, which are the removals of stopwords, punctuation, low-frequency words, and recurring words. To get a fair comparison, we propose and run an efficient and dynamic HPO procedure that can be done gradually on a limited resource and does not require a long-running optimization library. Using the best hack found, we then compare 512, 256, and 128 tokens length. We find that removing stopwords while keeping punctuation and low-frequency words is the best hack. Some of our setups manage to outperform taking 512 first tokens using a smaller 128 or 256 first tokens which manage to represent the same information while requiring less computational resources. The findings could help developers to efficiently pursue optimal performance of the models using limited resources.
翻译:许多NLP研究者依赖免费计算服务(如Google Colab)微调Transformer模型,但该方法因具有二次复杂度且需要更大资源,导致长文本分类中超参数优化(HPO)受到限制。在印尼语中,仅有少量研究使用Transformer进行长文本分类,多数仅使用少量数据且未报告任何HPO。本研究利用18,000篇新闻文章,基于分词器输出长度探究推荐使用的预训练模型,随后比较了缩短并丰富序列的若干技巧:移除停用词、标点符号、低频词和重复词。为获得公平比较,我们提出并实施了一种高效动态的HPO流程,该流程可在有限资源下逐步完成,无需长期运行的优化库。采用最佳技巧后,我们比较了512、256和128词元长度。发现保留标点符号与低频词的同时移除停用词是最佳方案。部分实验配置在使用更小的128或256初始词元时,不仅成功超越了512初始词元的表现,还能以更少计算资源表征相同信息。该发现可帮助开发者在有限资源下高效追求模型最优性能。