The speech signal is a consummate example of time-series data. The acoustics of the signal change over time, sometimes dramatically. Yet, the most common type of comparison we perform in phonetics is between instantaneous acoustic measurements, such as formant values. In the present paper, I discuss the concept of absement as a quantification of differences between two time-series. I then provide an experimental example of absement applied to phonetic analysis for human and/or computer speech recognition. The experiment is a template-based speech recognition task, using dynamic time warping to compare the acoustics between recordings of isolated words. A recognition accuracy of 57.9% was achieved. The results of the experiment are discussed in terms of using absement as a tool, as well as the implications of using acoustics-only models of spoken word recognition with the word as the smallest discrete linguistic unit.
翻译:语音信号是时间序列数据的典型范例。声学信号随时间变化,有时甚至发生剧烈变化。然而,语音学中最常见的比较方式是对瞬时声学测量值(例如共振峰频率)进行对比。本文探讨了“时间偏离量”(absement)这一概念,将其作为量化两个时间序列之间差异的指标。随后,我提供了一个将时间偏离量应用于语音学分析的实验案例,服务于人类和/或计算机语音识别。该实验是一项基于模板的语音识别任务,采用动态时间规整(DTW)对孤立词语录音的声学特征进行比较,并实现了57.9%的识别准确率。本文基于将时间偏离量作为分析工具的视角讨论实验结果,并进一步探讨了在“单词”作为最小离散语言单位的假设下,仅使用声学模型进行口语词汇识别的潜在意义。