A Feature Engineering Approach for Literary and Colloquial Tamil Speech Classification using 1D-CNN

In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.

翻译：在理想的人机交互中，语言的日常口语形式因其符合用户的日常对话习惯而更受青睐。然而，保护正式书面语形式亦具有不可否认的必要性。通过接纳新形式并保留旧形式，既能服务于大众（实用性），也能服务于语言本身（保护性）。因此，理想的计算机应能根据需要接受、处理并使用这两种语言形式进行交互。为实现这一目标，首先需要识别输入语音的形式，本研究即针对泰米尔语的书面语与口语语音进行区分。此类前端系统必须包含一个简单、高效且轻量级的分类器，该分类器需基于少量能捕捉语音信号底层模式的有效特征进行训练。为此，本文提出一种一维卷积神经网络，该网络能够学习特征随时间变化的包络。网络首先基于精选的手工设计特征进行训练，随后使用梅尔频率倒谱系数进行对比训练。手工设计特征的选择涵盖了语音的多个方面，包括频谱与时间特性、韵律及音质。通过分析十条平行语音样本并观察各特征随时间的变化趋势，对这些特征进行了初步分析。使用手工特征训练的所提一维卷积神经网络取得了0.9803的F1分数，而基于MFCC训练的模型则获得0.9895的F1分数。在此基础上，本文进一步探索了特征消融与特征组合策略。将特征消融研究中排名最优的手工特征与MFCC结合后，取得了最佳结果，F1分数达到0.9946。