This case study investigates the task of job classification in a real-world setting, where the goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position. We explore multiple approaches to text classification, including supervised approaches such as traditional models like Support Vector Machines (SVMs) and state-of-the-art deep learning methods such as DeBERTa. We compare them with Large Language Models (LLMs) used in both few-shot and zero-shot classification settings. To accomplish this task, we employ prompt engineering, a technique that involves designing prompts to guide the LLMs towards the desired output. Specifically, we evaluate the performance of two commercially available state-of-the-art GPT-3.5-based language models, text-davinci-003 and gpt-3.5-turbo. We also conduct a detailed analysis of the impact of different aspects of prompt engineering on the model's performance. Our results show that, with a well-designed prompt, a zero-shot gpt-3.5-turbo classifier outperforms all other models, achieving a 6% increase in Precision@95% Recall compared to the best supervised approach. Furthermore, we observe that the wording of the prompt is a critical factor in eliciting the appropriate "reasoning" in the model, and that seemingly minor aspects of the prompt significantly affect the model's performance.
翻译:本案例研究探讨了真实场景下的职位分类任务,目标在于判断英文职位描述是否适合应届生或入门级岗位。我们探索了多种文本分类方法,包括监督式方法(如支持向量机等传统模型)与前沿深度学习方法(如DeBERTa),并将其与用于少样本和零样本分类设置的大语言模型进行对比。为实现该任务,我们采用了提示工程技术——即设计提示以引导大语言模型输出预期结果。具体而言,我们评估了两款商用前沿GPT-3.5语言模型(text-davinci-003与gpt-3.5-turbo)的性能,并详细分析了提示工程不同方面对模型性能的影响。结果表明,在精心设计的提示驱动下,零样本gpt-3.5-turbo分类器性能超越所有其他模型,其95%召回率下的精确率相较于最优监督方法提升6%。此外,我们观察到提示的措辞是引发模型产生适当"推理"的关键因素,而提示中看似微小的细节对模型性能产生显著影响。