A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case

This research compares large language model (LLM) fine-tuning methods, including Quantized Low Rank Adapter (QLoRA), Retrieval Augmented fine-tuning (RAFT), and Reinforcement Learning from Human Feedback (RLHF), and additionally compared LLM evaluation methods including End to End (E2E) benchmark method of "Golden Answers", traditional natural language processing (NLP) metrics, RAG Assessment (Ragas), OpenAI GPT-4 evaluation metrics, and human evaluation, using the travel chatbot use case. The travel dataset was sourced from the the Reddit API by requesting posts from travel-related subreddits to get travel-related conversation prompts and personalized travel experiences, and augmented for each fine-tuning method. We used two pretrained LLMs utilized for fine-tuning research: LLaMa 2 7B, and Mistral 7B. QLoRA and RAFT are applied to the two pretrained models. The inferences from these models are extensively evaluated against the aforementioned metrics. The best model according to human evaluation and some GPT-4 metrics was Mistral RAFT, so this underwent a Reinforcement Learning from Human Feedback (RLHF) training pipeline, and ultimately was evaluated as the best model. Our main findings are that: 1) quantitative and Ragas metrics do not align with human evaluation, 2) Open AI GPT-4 evaluation most aligns with human evaluation, 3) it is essential to keep humans in the loop for evaluation because, 4) traditional NLP metrics insufficient, 5) Mistral generally outperformed LLaMa, 6) RAFT outperforms QLoRA, but still needs postprocessing, 7) RLHF improves model performance significantly. Next steps include improving data quality, increasing data quantity, exploring RAG methods, and focusing data collection on a specific city, which would improve data quality by narrowing the focus, while creating a useful product.

翻译：本研究比较了大型语言模型（LLM）的微调方法，包括量化低秩适配器（QLoRA）、检索增强微调（RAFT）和基于人类反馈的强化学习（RLHF），并进一步对比了LLM评估方法，包括“黄金答案”端到端（E2E）基准方法、传统自然语言处理（NLP）指标、RAG评估（Ragas）、OpenAI GPT-4评估指标以及人工评估，并以旅行聊天机器人为应用案例。旅行数据集通过Reddit API从旅行相关子版块获取帖子，收集旅行相关对话提示和个性化旅行经验，并针对每种微调方法进行了数据增强。我们使用了两个用于微调研究的预训练LLM：LLaMa 2 7B和Mistral 7B。QLoRA和RAFT方法应用于这两个预训练模型。这些模型的推理结果通过上述指标进行了全面评估。根据人工评估和部分GPT-4指标，表现最佳的模型是Mistral RAFT，因此该模型进一步经历了基于人类反馈的强化学习（RLHF）训练流程，并最终被评估为最佳模型。我们的主要发现如下：1）定量指标和Ragas指标与人工评估结果不一致；2）OpenAI GPT-4评估与人工评估最为吻合；3）必须在评估环节保留人工参与，因为；4）传统NLP指标不足；5）Mistral模型总体表现优于LLaMa；6）RAFT方法优于QLoRA，但仍需后处理；7）RLHF能显著提升模型性能。后续工作包括提升数据质量、增加数据量、探索RAG方法，以及将数据收集聚焦于特定城市——通过缩小关注范围来提升数据质量，同时构建实用产品。