Paper analysis “Prometheus - Inducing Fine-grained Evaluation Capability in Language Models”

Table of Contents

The Why?

Evaluating RAG-based chatbot performance can be challenging, particularly when diverse customer datasets are at play. The central question is: How do we measure the quality of a conversation when the parameters of ‘good’ are so variable? In the previous article, I proposed a simple method which was based on generating synthetic questions from customer data, and then we were able to measure quality using RAGAs. This involved a two-fold process where GPT-4 not only generated responses but also served as a judge to evaluate their quality based on the context provided. While this method is quick to implement and sets a good baseline, it has many limitations.

Limitations

To mitigate these issues, we have adopted parallel execution and threading to expedite the process. However, to manage costs, we have limited the frequency of evaluations to periods of significant change.

Is there a better approach?

There are many papers that tries to work on the limitations discussed above. Some of them are:

I got inspired by the work “Prometheus: Inducing Fine-grained Evaluation Capability in Language Models”. The cornerstone of this paper is its innovative approach to generating evaluation datasets based on rubric metric and finetuning local LLM on it. This enables the construction of a specialized Judge Language Model (LLM) that operates on open-source principles and is fine-tuned for nuanced assessment.

An open-source LLM as a judge

The authors assert that by utilizing this approach, performance can match that of GPT-4. This might sound ambitious, but it’s based on the principle that specialized models can outperform more generalist ones in specific tasks.

Data Structure

The Evaluation Model in Action

We start with Model 1, which could be our chatbot. The evaluation process is initiated with specific instructions that set the context for the chatbot’s task. These instructions are aligned with criteria and a scoring system ranging from 1 to 5, with descriptions for each score, for example reflecting the degree of emotional intelligence and respectfulness in the response.

Here’s a step-by-step breakdown:

  1. Instruction Input: An instruction is provided to the chatbot, which could be a request for information or a customer service inquiry.
  2. Criteria Definition: Accompanying the instruction is a set of criteria that outlines the expectations for the chatbot’s response.
  3. Original Response: The chatbot processes the instruction, user_input, scorining_criterias and responds accordingly.
  4. Evaluation by LLM Evaluator: A third-party evaluator, the JudgeLLM, then assesses the chatbot’s response against the criteria using a rubric score. The JudgeLLM provides feedback on the response and assigns a score based on the predefined criteria.

Main Considerations During Dataset Construction

To construct a robust dataset, we focus on:

  1. Comprehensive Reference Material: Incorporating a wide array of examples to cover various scenarios.
  2. Uniform Reference Length: Ensuring that the length of reference answers is consistent. This is important because LLMs frequently score long answers with higher score, even though the content is not good.
  3. Balanced Score Distribution: We would like to generate dataset with scores that have a meaningful descriptions: Score 1 is bad and progressively improving it until Score 5 which suppose to be a target answer.
  4. Scope Limitation: Focusing on realistic situations where a user interacts with a chatbot, ensuring that the instructions and responses are relevant and practical.

The Four-Step Process to Constructing a Robust Evaluation Dataset

Creating an evaluation dataset from scratch can be a daunting task. However, the methodology introduced in the paper offers a structured and scalable approach. Let’s dive into the four-step process they’ve outlined.

Step 1: Seed Rubrics Creation

Step 2: Rubric Expansion

Step 3: Generating Instructions and Reference Answers

Step 4: Response and Feedback Generation

Fine-tuning an evaluator LM

The original code could be found here. The following prompt format (already processed in the ‘output’) was used to train the evaluator LM:

{orig_feedback} [RESULT] {orig_score}

Then during evaluation, we parsed the prediction after the phrase [RESULT].

Finetuning

Evaluation Datasets

Results

Summary

Code

In order to reproduce paper results, we can follow the original code. However, I prefer the implementations using Langchain, which can be found here and here. To perform finetuning, authors do it on their own clusters. I experimented with two platforms for fine-tuning. They were both user-friendly and reasonably priced: OpenPipe and TogetherAI.

References