FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language

from arxiv, Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from www.forbes.com/sites/forbestechcouncil/2023/02/17/is-bigger-better-why-the-chatgpt-vs-gpt-3-vs-gpt-4-battle-is-just-a-family-chat, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

Formatting is an important property in tables for visualization, presentation, and analysis. Spreadsheet software allows users to automatically format their tables by writing data-dependent conditional formatting (CF) rules. Writing such rules is often challenging for users as it requires them to understand and implement the underlying logic. We present FormaT5, a transformer-based model that can generate a CF rule given the target table and a natural language description of the desired formatting logic. We find that user descriptions for these tasks are often under-specified or ambiguous, making it harder for code generation systems to accurately learn the desired rule in a single step. To tackle this problem of under-specification and minimise argument errors, FormaT5 learns to predict placeholders though an abstention objective. These placeholders can then be filled by a second model or, when examples of rows that should be formatted are available, by a programming-by-example system. To evaluate FormaT5 on diverse and real scenarios, we create an extensive benchmark of 1053 CF tasks, containing real-world descriptions collected from four different sources. We release our benchmarks to encourage research in this area. Abstention and filling allow FormaT5 to outperform 8 different neural approaches on our benchmarks, both with and without examples. Our results illustrate the value of building domain-specific learning systems.

翻译：格式化是表格在可视化、呈现和分析中的重要属性。电子表格软件允许用户通过编写数据依赖的条件格式化规则自动格式化表格。但编写此类规则对用户而言往往具有挑战性，因为这要求他们理解并实现底层逻辑。我们提出了FormaT5——一种基于Transformer的模型，可根据目标表格和所需格式化逻辑的自然语言描述生成条件格式化规则。我们发现，用户对这些任务的描述常存在不完整或歧义，这使得代码生成系统难以一步准确学习所需规则。为解决这种不完整性问题并最小化参数错误，FormaT5通过弃权目标学习预测占位符。这些占位符可由第二个模型填充，或在存在应格式化行示例时，通过编程示例系统填充。为在多样化的真实场景中评估FormaT5，我们构建了包含1053个条件格式化任务的综合基准数据集，其中包含来自四个不同来源的真实世界描述。我们公开该基准数据集以鼓励该领域研究。弃权机制与填充功能使FormaT5在基准测试中（无论是否提供示例）均优于8种不同神经方法。我们的结果证明了构建领域专用学习系统的价值。