In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simplification, paraphrase generation, style transfer, etc. In contrast to text summarization and open-ended text completion (e.g., story), the text-to-text generation tasks we discuss in this tutorial are more constrained in terms of semantic consistency and targeted language styles. This level of control makes these tasks ideal testbeds for studying the ability of models to generate text that is both semantically adequate and stylistically appropriate. Moreover, these tasks are interesting from a technical standpoint, as they require complex combinations of lexical and syntactical transformations, stylistic control, and adherence to factual knowledge, -- all at once. With a special focus on text simplification and revision, this tutorial aims to provide an overview of the state-of-the-art natural language generation research from four major aspects -- Data, Models, Human-AI Collaboration, and Evaluation -- and to discuss and showcase a few significant and recent advances: (1) the use of non-retrogressive approaches; (2) the shift from fine-tuning to prompting with large language models; (3) the development of new learnable metric and fine-grained human evaluation framework; (4) a growing body of studies and datasets on non-English languages; (5) the rise of HCI+NLP+Accessibility interdisciplinary research to create real-world writing assistant systems.
翻译:在本教程中,我们聚焦于文本到文本生成——一类自然语言生成(NLG)任务。该任务以文本片段为输入,生成根据特定标准(如可读性或语言风格)改进后的修订版本,同时基本保留原始含义和文本长度。这涵盖了诸多实用应用,如文本简化、释义生成、风格迁移等。与文本摘要和开放式文本补全(如故事生成)不同,本教程讨论的文本到文本生成任务在语义一致性和目标语言风格方面具有更强约束。这种可控性使这些任务成为研究模型生成兼具语义充分性与风格恰当性文本能力的理想试验场。从技术角度看,这些任务因需同时实现词汇句法变换、风格控制与事实知识遵循的复杂组合而极具研究价值。本教程以文本简化与修订为重点,从数据、模型、人机协作和评估四大维度梳理当前最先进的自然语言生成研究,并讨论与展示若干重要前沿进展:(1)非回溯方法的应用;(2)从微调向大语言模型提示策略的转变;(3)新型可学习评估指标与细粒度人工评估框架的开发;(4)非英语语言研究与数据集的日益丰富;(5)人机交互、自然语言处理与可访问性交叉研究催生真实写作辅助系统的新趋势。