In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!
翻译:在生成式自然语言处理(NLP)的动态环境中,传统的文本处理流水线因针对特定数据集、任务和模型组合设计而限制了研究的灵活性与可复现性。随着涉及系统提示词、模型专用格式、指令等要素的复杂性日益增加,亟需转向结构化、模块化且可定制的解决方案。为应对这一需求,我们提出Unitxt——一个专为生成式语言模型设计的可定制文本数据准备与评估创新库。Unitxt原生集成HuggingFace、LM-eval-harness等通用库,将处理流程解构为模块化组件,支持从业人员轻松定制与共享。这些组件涵盖模型专用格式、任务提示词及众多综合性数据集处理定义。Unitxt-Catalog实现组件的集中管理,促进现代文本数据处理工作流中的协作与探索。Unitxt不仅是一个工具,更是一个社区驱动的平台,赋能用户协作构建、共享并推进其流水线。欢迎加入Unitxt社区:https://github.com/IBM/unitxt!