Product Question Answering (PQA) systems are key in e-commerce applications to provide responses to customers' questions as they shop for products. While existing work on PQA focuses mainly on English, in practice there is need to support multiple customer languages while leveraging product information available in English. To study this practical industrial task, we present xPQA, a large-scale annotated cross-lingual PQA dataset in 12 languages across 9 branches, and report results in (1) candidate ranking, to select the best English candidate containing the information to answer a non-English question; and (2) answer generation, to generate a natural-sounding non-English answer based on the selected English candidate. We evaluate various approaches involving machine translation at runtime or offline, leveraging multilingual pre-trained LMs, and including or excluding xPQA training data. We find that (1) In-domain data is essential as cross-lingual rankers trained on other domains perform poorly on the PQA task; (2) Candidate ranking often prefers runtime-translation approaches while answer generation prefers multilingual approaches; (3) Translating offline to augment multilingual models helps candidate ranking mainly on languages with non-Latin scripts; and helps answer generation mainly on languages with Latin scripts. Still, there remains a significant performance gap between the English and the cross-lingual test sets.
翻译:产品问答系统是电子商务应用中的关键组成部分,用于在客户选购商品时提供问题解答。现有产品问答研究主要聚焦英语,而实际应用中需要支持多种客户语言,同时利用以英语呈现的产品信息。为研究这一工业实践任务,我们提出xPQA——一个覆盖9个分支、12种语言的大规模标注跨语言产品问答数据集,并报告以下两项任务的结果:(1) 候选答案排序:从最佳英语候选答案中提取信息以回答非英语问题;(2) 答案生成:基于选定的英语候选答案生成自然流畅的非英语回复。我们评估了多种方法,包括运行时或离线机器翻译、利用多语言预训练语言模型,以及是否引入xPQA训练数据。实验发现:(1) 领域内数据至关重要,基于其他领域训练的跨语言排序模型在产品问答任务中表现不佳;(2) 候选答案排序偏好运行时翻译方法,而答案生成偏好多语言方法;(3) 离线翻译增强多语言模型主要提升非拉丁语系语言的排序性能,以及拉丁语系语言的生成性能。尽管如此,英语测试集与跨语言测试集之间仍存在显著性能差距。