Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user's preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models mostly ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose a Persona-aware Generative Model for Code-mixed Generation, PARADOX, a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves 1.6 points better CM BLEU, 47% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
翻译:混码与混写现象在在线社交网络和多语言社会中普遍存在。然而,用户对混码的偏好取决于其社会经济地位、人口统计特征及当地语境,而现有生成模型在生成混码文本时大多忽视了这些因素。本研究首次尝试构建一种基于人格的生成模型,以生成贴近个人真实混码使用习惯的文本。我们提出了一种面向混码生成的人格感知生成模型——PARADOX,该模型采用基于Transformer的编码器-解码器架构,能够根据用户人格特征对输入语句进行条件编码,并在无需单语参考数据的情况下生成混码文本。我们设计了一个对齐模块,可重新校准生成序列以使其更贴近真实混码文本。PARADOX生成的混码文本在语义完整性和语言学有效性方面均表现更优。为评估PARADOX的人格化能力,我们提出了四项新指标:CM BLEU、CM Rouge-1、CM Rouge-L和CM KS。平均而言,相较于无基线模型,PARADOX在CM BLEU指标上提升1.6分,困惑度改善47%,语义连贯性提升32%。