Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.
翻译:以往研究已识别出大量可潜在辅助各类任务的手工语言学特征。然而,由于这些特征数量庞大,有效选择和利用现有手工特征面临困难。加之不同研究工作中实现方式不统一的问题,目前既缺乏分类体系,也不存在广泛接受的特征命名规范。这造成了不必要的混淆。此外,多数现有手工特征提取库并非开源或未得到持续维护。因此,研究人员往往需要从零开始构建此类提取系统。我们基于既有文献收集并分类了220多种流行的手工特征,随后针对若干任务特定数据集开展相关性分析研究,报告了各特征的潜在应用场景。最后,我们以系统可扩展的方式设计了一套多语言手工语言学特征提取系统,并将该系统开源以提供丰富的预实现手工特征。该系统命名为LFTK,是同类系统中规模最大的工具。访问地址:github.com/brucewlee/lftk。