Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.
翻译:过往研究已识别出一系列丰富的人工语言特征,这些特征可能有助于多种任务。然而,由于其数量庞大,有效选择和利用现有的人工特征变得困难。加之不同研究工作中实现方式不一致的问题,目前尚未形成统一的分类方案或公认的特征命名规则,这导致了不必要的混淆。此外,大多数现有的人工特征提取库并非开源或缺乏积极维护。因此,研究人员往往需要从头构建此类提取系统。我们基于过往文献收集并分类了超过220种主流人工特征,随后针对多个任务特定数据集进行了相关性分析研究,并报告了每种特征的潜在应用场景。最后,我们以系统可扩展的方式设计了一个多语言人工语言特征提取系统,并将该系统开源,以便公众访问丰富的预实现人工特征集。我们将该系统命名为LFTK,它是同类系统中规模最大的。访问地址:github.com/brucewlee/lftk。