基于合取命题子句的可扩展多阶段词嵌入方法 (Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses)

The Tsetlin Machine (TM) architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset's vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.

翻译：Tsetlin Machine（TM）架构近期在机器学习（ML）领域，特别是在自然语言处理（NLP）中，展现了显著的有效性。该架构已被用于通过合取命题子句构建词嵌入，从而显著增强了对机器决策的理解与解释能力。先前的方法通过对输入词序列执行词嵌入，将信息整合为连贯统一的表示形式。然而，随着输入规模的增大，该方法面临可扩展性挑战。本研究提出了一种结合两阶段训练的新方法，以发现输入序列的上下文嵌入。具体而言，该方法首先将数据集中词汇表内每个输入词的知识进行封装，随后利用提取的知识为输入词序列构建嵌入表示。此技术不仅有助于设计可扩展的模型，同时保持了可解释性。实验结果表明，与先前方法相比，所提出的方法具有竞争性的性能，相较于人工生成的基准测试亦展现出良好的效果。此外，我们将所提出的方法应用于IMDB数据集的情感分析任务，其中TM嵌入与TM分类器以及其他可解释分类器共同提供了一个透明的端到端解决方案，并取得了具有竞争力的性能。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日