Built-in Evaluators
Use pre-built evaluators for common quality assessments including relevance, accuracy, toxicity, and more
Built-in Evaluators
Brokle provides pre-built evaluators for common AI quality assessments. These evaluators are ready to use and cover standard quality dimensions like relevance, accuracy, and safety.
Available Evaluators
| Evaluator | Type | Description |
|---|---|---|
| relevance | LLM | How well the output addresses the input |
| accuracy | LLM | Factual correctness of claims |
| helpfulness | LLM | How useful the response is |
| coherence | LLM | Logical flow and structure |
| groundedness | LLM | Alignment with provided context |
| toxicity | Model | Harmful or offensive content detection |
| pii_detection | Regex/ML | Personal information detection |
| json_validity | Programmatic | Valid JSON structure check |
| length | Programmatic | Response length metrics |
Quick Start
from brokle import Brokle
from brokle.evaluation import evaluate
client = Brokle()
# Evaluate a single output
result = evaluate(
evaluator="relevance",
input="What is machine learning?",
output="Machine learning is a subset of AI that enables systems to learn from data."
)
print(f"Score: {result.score}")
print(f"Reasoning: {result.reasoning}")import { Brokle } from 'brokle';
import { evaluate } from 'brokle/evaluation';
const client = new Brokle();
// Evaluate a single output
const result = await evaluate({
evaluator: 'relevance',
input: 'What is machine learning?',
output: 'Machine learning is a subset of AI that enables systems to learn from data.'
});
console.log(`Score: ${result.score}`);
console.log(`Reasoning: ${result.reasoning}`);Quality Evaluators
Relevance
Measures how directly the output addresses the input query.
result = evaluate(
evaluator="relevance",
input="How do I reset my password?",
output="To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the instructions sent to your inbox."
)
# Result: score=0.95, reasoning="Response directly addresses the password reset question with clear steps"const result = await evaluate({
evaluator: 'relevance',
input: 'How do I reset my password?',
output: "To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the instructions sent to your inbox."
});
// Result: score=0.95, reasoning="Response directly addresses the password reset question with clear steps"Score interpretation:
- 0.9-1.0: Perfectly relevant, directly addresses the query
- 0.7-0.9: Mostly relevant with minor tangents
- 0.5-0.7: Partially relevant, some off-topic content
- 0.0-0.5: Largely irrelevant to the query
Accuracy
Evaluates factual correctness of claims in the output.
result = evaluate(
evaluator="accuracy",
input="When was Python released?",
output="Python was first released in 1991 by Guido van Rossum.",
context="Python is a programming language created by Guido van Rossum. Version 0.9.0 was released in February 1991."
)
# Result: score=0.98, reasoning="Date and creator are factually correct per the context"const result = await evaluate({
evaluator: 'accuracy',
input: 'When was Python released?',
output: 'Python was first released in 1991 by Guido van Rossum.',
context: 'Python is a programming language created by Guido van Rossum. Version 0.9.0 was released in February 1991.'
});
// Result: score=0.98, reasoning="Date and creator are factually correct per the context"For best accuracy evaluation, provide context with source material to verify claims against.
Helpfulness
Assesses how useful and actionable the response is.
result = evaluate(
evaluator="helpfulness",
input="I'm getting a TypeError in my Python code",
output="A TypeError occurs when you try to perform an operation on incompatible types. To fix this:\n1. Check the error message for the line number\n2. Verify the types of variables involved\n3. Use type() or isinstance() to debug\n4. Add type conversions where needed"
)
# Result: score=0.92, reasoning="Provides clear explanation and actionable debugging steps"Coherence
Measures logical flow, structure, and readability.
result = evaluate(
evaluator="coherence",
input="Explain how APIs work",
output="APIs (Application Programming Interfaces) are... Then the server processes... First, the client sends... Finally, the response..."
)
# Result: score=0.65, reasoning="Content is relevant but structure is disorganized with out-of-order explanations"Groundedness
For RAG applications: checks if the output is supported by provided context.
result = evaluate(
evaluator="groundedness",
input="What are our refund policies?",
output="Refunds are processed within 5-7 business days. You can request a refund through the app or by contacting support.",
context="Our refund policy: Customers can request refunds within 30 days of purchase. Refunds are processed within 5-7 business days. Contact support@example.com or use the in-app refund request feature."
)
# Result: score=0.95, reasoning="All claims are directly supported by the provided context"Safety Evaluators
Toxicity
Detects harmful, offensive, or inappropriate content.
result = evaluate(
evaluator="toxicity",
output="Here's how to create a simple Python function that adds two numbers..."
)
# Result: score=0.01, reasoning="No toxic content detected"
result = evaluate(
evaluator="toxicity",
output="That's a stupid question..."
)
# Result: score=0.72, reasoning="Contains dismissive and potentially offensive language"const result = await evaluate({
evaluator: 'toxicity',
output: "Here's how to create a simple Python function that adds two numbers..."
});
// Result: score=0.01, reasoning="No toxic content detected"For toxicity, lower scores are better. A score of 0.0 means no toxicity detected.
PII Detection
Identifies personal identifiable information in outputs.
result = evaluate(
evaluator="pii_detection",
output="Please contact John Smith at john.smith@example.com or call 555-123-4567"
)
# Result: score=0.85, reasoning="Detected email address and phone number"
# metadata: {"pii_types": ["email", "phone"], "locations": [...]}Detected PII types:
- Email addresses
- Phone numbers
- Social Security Numbers
- Credit card numbers
- Names (when context suggests PII)
- Addresses
Programmatic Evaluators
JSON Validity
Checks if the output is valid JSON.
result = evaluate(
evaluator="json_validity",
output='{"name": "Alice", "age": 30}'
)
# Result: score=1.0, reasoning="Valid JSON object"
result = evaluate(
evaluator="json_validity",
output='{"name": "Alice", age: 30}' # Missing quotes
)
# Result: score=0.0, reasoning="Invalid JSON: Expecting property name enclosed in double quotes"Length Evaluator
Measures response length against targets.
result = evaluate(
evaluator="length",
output="This is a short response.",
config={
"min_length": 50,
"max_length": 500,
"target_length": 200
}
)
# Result: score=0.3, reasoning="Response length (26 chars) is below minimum (50)"Batch Evaluation
Evaluate multiple outputs efficiently:
from brokle.evaluation import evaluate_batch
outputs = [
{"input": "Question 1", "output": "Answer 1"},
{"input": "Question 2", "output": "Answer 2"},
{"input": "Question 3", "output": "Answer 3"},
]
results = evaluate_batch(
evaluator="relevance",
items=outputs
)
for result in results:
print(f"Score: {result.score}")import { evaluateBatch } from 'brokle/evaluation';
const outputs = [
{ input: 'Question 1', output: 'Answer 1' },
{ input: 'Question 2', output: 'Answer 2' },
{ input: 'Question 3', output: 'Answer 3' },
];
const results = await evaluateBatch({
evaluator: 'relevance',
items: outputs
});
results.forEach(result => {
console.log(`Score: ${result.score}`);
});Multiple Evaluators
Run multiple evaluators on the same output:
from brokle.evaluation import evaluate_multi
result = evaluate_multi(
evaluators=["relevance", "helpfulness", "toxicity"],
input="How do I fix this bug?",
output="Here's how to debug your issue: First, check the error logs..."
)
for name, score in result.scores.items():
print(f"{name}: {score.value}")
# relevance: 0.88
# helpfulness: 0.92
# toxicity: 0.02Attaching to Traces
Automatically score traces during or after creation:
# During trace creation
with client.start_as_current_span(name="chat") as span:
response = llm.generate(prompt)
# Evaluate and attach scores
scores = evaluate_multi(
evaluators=["relevance", "helpfulness"],
input=prompt,
output=response
)
for name, result in scores.items():
span.score(name=name, value=result.score, comment=result.reasoning)
# After trace creation
client.evaluate_trace(
trace_id="trace_123",
evaluators=["relevance", "accuracy", "toxicity"]
)Configuration Options
Model Selection
Choose which model powers LLM-based evaluators:
result = evaluate(
evaluator="relevance",
input="...",
output="...",
config={
"model": "gpt-4o", # Default
"temperature": 0.0
}
)Custom Criteria
Add specific evaluation criteria:
result = evaluate(
evaluator="relevance",
input="...",
output="...",
config={
"criteria": [
"Must mention pricing",
"Should include call-to-action",
"Must be under 100 words"
]
}
)Scoring Mode
Control score output format:
# Continuous score (default)
result = evaluate(evaluator="relevance", ..., config={"scoring": "continuous"})
# score: 0.87
# Binary score
result = evaluate(evaluator="relevance", ..., config={"scoring": "binary"})
# score: 1 (pass) or 0 (fail)
# Categorical score
result = evaluate(evaluator="relevance", ..., config={"scoring": "categorical"})
# score: "high", "medium", or "low"Evaluator Response
All evaluators return a consistent response structure:
@dataclass
class EvaluationResult:
score: float # Numeric score (0.0-1.0)
reasoning: str # Explanation of the score
metadata: dict # Additional details
evaluator: str # Evaluator name
model: str | None # Model used (for LLM evaluators)
latency_ms: int # Evaluation timePerformance Tips
1. Batch Similar Evaluations
# Good: Batch processing
results = evaluate_batch(evaluator="relevance", items=all_outputs)
# Bad: Individual calls in a loop
for output in all_outputs:
result = evaluate(evaluator="relevance", ...) # Slower2. Use Appropriate Evaluators
| Need | Evaluator | Speed |
|---|---|---|
| Format check | json_validity, length | Fast (1ms) |
| Pattern match | pii_detection | Fast (5ms) |
| Quality assessment | relevance, helpfulness | Moderate (500ms) |
| Complex reasoning | accuracy with context | Slower (1-2s) |
3. Cache Results
# Evaluator results are cached by default
result1 = evaluate(evaluator="relevance", input="Q", output="A") # API call
result2 = evaluate(evaluator="relevance", input="Q", output="A") # CachedBest Practices
1. Start with Core Evaluators
Begin with these for most applications:
- relevance: Primary quality signal
- helpfulness: User value
- toxicity: Safety check
2. Add Context When Available
# Better accuracy with context
result = evaluate(
evaluator="accuracy",
input="...",
output="...",
context=source_documents # Improves evaluation quality
)3. Combine Multiple Signals
# Composite quality score
scores = evaluate_multi(
evaluators=["relevance", "helpfulness", "coherence"],
input="...",
output="..."
)
composite_score = (
scores["relevance"].score * 0.4 +
scores["helpfulness"].score * 0.4 +
scores["coherence"].score * 0.2
)