LFTK: Handcrafted Features in Computational Linguistics

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.

翻译：以往研究已识别出大量可潜在辅助各类任务的手工语言学特征。然而，由于这些特征数量庞大，有效选择和利用现有手工特征面临困难。加之不同研究工作中实现方式不统一的问题，目前既缺乏分类体系，也不存在广泛接受的特征命名规范。这造成了不必要的混淆。此外，多数现有手工特征提取库并非开源或未得到持续维护。因此，研究人员往往需要从零开始构建此类提取系统。我们基于既有文献收集并分类了220多种流行的手工特征，随后针对若干任务特定数据集开展相关性分析研究，报告了各特征的潜在应用场景。最后，我们以系统可扩展的方式设计了一套多语言手工语言学特征提取系统，并将该系统开源以提供丰富的预实现手工特征。该系统命名为LFTK，是同类系统中规模最大的工具。访问地址：github.com/brucewlee/lftk。

相关内容

Computational Linguistics

关注 846

计算语言学(Computational Linguistics)是历史最悠久的出版物，专门研究语言的计算和数学特性以及自然语言处理系统的设计和分析。这本备受推崇的季刊为大学和工业界的语言学家、计算语言学家、人工智能和机器学习研究者、认知科学家、语言专家和哲学家提供有关语言研究各个方面的计算方面的最新信息。官网地址：http://dblp.uni-trier.de/db/journals/coling/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日