Characters do not convey meaning, but sequences of characters do. We propose an unsupervised distributional method to learn the abstract meaningful units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the objects in the sequence, extending an architecture for object discovery in images. We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.
翻译:字符本身并不传达语义,但字符序列能够传达。我们提出一种无监督的分布方法,用于在学习字符序列中的抽象有意义单元。不同于传统的序列切分方法,我们的动态容量槽注意力模型能够发现序列中物体的连续表征,这一方法扩展了图像物体发现的架构。我们在多种语言上训练模型,并通过前向和反向探测分类器评估所得表征的质量。实验表明,我们的模型成功发现了在形式、内容和抽象程度上与先前提出的单元相似的单元,这些单元在捕捉更高抽象层次的有意义信息方面展现出潜力。