Why Evaluation Matters
You've deployed your LLM application, traces are flowing in, and everything looks healthy — until a user screenshots a completely wrong answer and posts it on Twitter. Sound familiar?
The gap between "model returns a response" and "model returns a good response" is where evaluation comes in. Without systematic evaluation, you're flying blind.
The Three Pillars of LLM Evaluation
1. LLM-as-Judge
The fastest way to get started. Use a separate LLM to score your application's outputs against defined criteria.
from brokle import Brokle
brokle = Brokle()
with brokle.trace("customer-support") as trace:
response = generate_response(user_query)
# Score with LLM-as-judge
trace.evaluate(
name="relevance",
evaluator="llm",
criteria="Is the response relevant to the user's question?",
output=response,
)When to use it: Early-stage quality monitoring, broad coverage across many interactions, catching obvious failures.
Watch out for: Judge model biases, cost at scale, and the meta-problem of evaluating the evaluator.
2. Custom Evaluators
Rule-based checks that encode your domain knowledge. These are deterministic, fast, and free.
# Check response length constraints
trace.evaluate(
name="length-check",
evaluator="custom",
score=1.0 if 50 < len(response.split()) < 500 else 0.0,
)
# Verify no PII leakage
trace.evaluate(
name="pii-check",
evaluator="custom",
score=0.0 if contains_pii(response) else 1.0,
)When to use it: Hard constraints (length, format, safety), compliance requirements, known failure modes.
3. Human Feedback
The gold standard. Collect explicit ratings or implicit signals from real users.
# Thumbs up/down from user
trace.evaluate(
name="user-feedback",
evaluator="human",
score=1.0, # User clicked thumbs up
comment="Helpful and accurate",
)When to use it: Calibrating automated evaluators, edge cases where automated scoring fails, building labeled datasets for fine-tuning.
Building Your Evaluation Pipeline
A good evaluation pipeline layers all three approaches:
- Custom evaluators run on every trace — they're cheap and catch known issues
- LLM-as-judge samples a percentage of traces for deeper quality analysis
- Human feedback provides ground truth for the hardest cases
Every trace → Custom evaluators (deterministic checks)
↓
10% sample → LLM-as-judge (quality scoring)
↓
Edge cases → Human review (ground truth)Setting Quality Thresholds
Don't just collect scores — act on them. Set up alerts in Brokle when quality drops below your thresholds:
- Relevance score < 0.7: Investigate prompt drift or retrieval issues
- Safety score < 1.0: Immediate review — potential harmful output
- User feedback < 0.8: Review recent prompt or model changes
What We Learned
After running evaluation pipelines across hundreds of thousands of traces, here's what surprised us:
- Most quality issues are retrieval problems, not generation problems. The model is usually fine — it's getting bad context.
- LLM-as-judge agrees with humans ~85% of the time. Good enough for monitoring, not good enough for final decisions.
- Custom evaluators catch 60% of issues despite being the simplest approach. Don't skip them.
Get Started
Brokle's evaluation system is available in both our Python SDK and JavaScript SDK. Check out the evaluation docs for the full API reference and more examples.