LLM-as-a-judge evaluation uses an LLM to grade an output from an AI system, augmenting or replacing manual, human evaluation.
However, when used for quality, LLM-as-a-judge evaluations frequently face significant skepticism:
- Circular reasoning: AI engineers ask "how can I use an LLM to judge something that was generated by an LLM, even potentially the same LLM, and actually get a quality outcome?"
- Disappointing initial results: Or, they try using it, get bad results, and resign themselves to the fact that they will just need manual / human evaluations.
LLM-as-a-judge definitely can perform poorly. However, by giving the LLM an "unfair advantage" relative to the original task, I have seen that our customers are generally able to make the evaluation highly reliable.
In this post, I run through a bad evaluation, define unfair advantages, and then give concrete examples of several ways to give your evaluation an unfair advantage.
This advice applies if you're building a simple LLM feature, an agent, a Retrieval-Augmented Generation (RAG) pipeline, or any AI application that involves LLM-generated content.
Bad evaluations just ask the same question again
Let's provide a concrete example of a bad prompt that has no unfair advantages.
Consider the following, very simple LLM single-prompt feature which writes emails on behalf of a user.
Write an email following the user's directions.
User information:
{{ user }}
User writing style guide:
{{ writingStyle }}
User prompt:
{{ prompt }}
Let's say the following query comes in:
--- user ---
The user is Doug Safreno, and this is his business email.
He is the co-founder and CEO of Gentrace, a startup which provides a tailored evaluation and testing solution to leading technology companies.
He lives in San Francisco.
--- writingStyle ---
Be concise and direct.
--- prompt ---
Write an email to Vivek asking when the new derivations feature is expected to ship.
And the following output is generated:
Dear Vivek,
I hope this email finds you well. I wanted to check in on the status of the new derivations feature. As you know, this is an important capability for several of our large customers that we discussed at our last board meeting. Do you have an estimated timeline for when we can expect this to ship? Please let me know.
Best regards,
Doug
An example of a bad LLM-as-a-judge evaluation would be:
You are judging an email written as the following user.
User information:
{{ user }}
User writing style guide:
{{ writingStyle }}
User prompt:
{{ prompt }}
Email:
{{ email }}
From 1-10, how well does this email accomplish the user's goal?
This evaluation is bad because it asks the LLM the same question a second time - after you write an email given some information, grade the email you just wrote.
In my example, it writes:
Based on the user's writing style guide to be concise and direct, and the prompt to write an email to Vivek asking when the new derivations feature is expected to ship, I would rate the provided email a 9 out of 10.
The email is clear, direct, and gets the key point across - inquiring about the timeline for the new derivations feature. It is polite and professional in tone, while still being concise. The only minor area for potential improvement could be to make the request even more direct, such as "When can we expect the new derivations feature to ship?" But overall, the email effectively accomplishes the user's goal.
Better evaluations through "unfair advantages"
To make LLM-as-a-judge evaluations reliable, we need to give the LLM an "unfair advantage" during evaluation - some sort of additional context, tool, or capability that makes the evaluation task easier than the original generation task.
These unfair advantages help the LLM perform better by simplifying the evaluation into much more straightforward tasks with clearer criteria. This way, even smaller, faster models can give us high-confidence evaluations.
Let's go through a few examples.
Other unfair advantages to consider... or not
(Works well but situational) Use multi-modal advantages
Example: I'm generating a web component in response to a user query.
Asking whether the generated HTML output is good for the query might not yield significant results. However, evaluating the image version can be more effective because the grader benefits from the visual representation during grading.
This approach only works when there is a multi-modal representation of the final output that is much easier to grade.
Note: this approach can be combined with asserts or comparison for very good results.
(Frequently bad) Use a general rubric
Rather than relying on specific asserts, you may want to create a general rubric of characteristics that should be present in the output.
Example: The output should not include any irrelevant information
However, these instructions generally can also be included in the original prompt, and therefore aren't really an "unfair" advantage. Evaluation reliability therefore suffers.
General rubrics can work well for evaluations that aren't geared around output quality - e.g., for measuring safety, a rubric of "here are the things we shouldn't ever talk about" can be very useful.
However, for quality, you're generally better off choosing another route.
(Frequently bad) Use a stronger model as the grader
This is highly situational - most times, asking a weaker model to generate and a stronger model to grade doesn't yield great results without another unfair advantage of some sort. Also, this comes with significant cost and slower grading.
However, when the task involves significant reasoning and the model is really weak, a higher level model can sometimes accurately reason through the answer.
Example: I'm generating feedback on a legal document
When generated using a weaker model, asking "is this feedback good" with a strong model (e.g. o1) may yield some useful results.
Always ask "how can I create an unfair advantage?"
Hopefully, this is helpful framing for you to consider as you build out your AI evaluations. The goal is to give you a framework more than a particular answer for your application - to always ask the question.
Gentrace is a tool for building evaluations (like these) and monitoring in test and production. Learn more about Gentrace here.