Add quality scores to traces for measuring AI output quality, tracking metrics, and identifying improvement opportunities

Scores

Name: Brokle
Author: Brokle

Scores are numeric or categorical quality measurements attached to traces and generations. They enable systematic quality tracking, regression detection, and performance optimization.

Score Types

Brokle supports three score value types:

Type	Values	Use Case
Numeric	0.0 to 1.0	Continuous quality metrics
Categorical	String enum	Classification labels
Boolean	true/false	Binary checks

Quick Start

Initialize the Client

from brokle import Brokle

client = Brokle(api_key="bk_...")

import { Brokle } from 'brokle';

const client = new Brokle({ apiKey: 'bk_...' });

Add a Score to a Trace

# Add a quality score
client.scores.submit(
    trace_id="trace_abc123",
    name="relevance",
    value=0.85,
    comment="Response directly addresses the user's question"
)

// Add a quality score
await client.scores.submit({
  traceId: 'trace_abc123',
  name: 'relevance',
  value: 0.85,
  comment: "Response directly addresses the user's question"
});

View Scores in Dashboard

Navigate to Traces → Select a trace → Scores tab to see all attached scores.

Adding Scores

To Traces

Score an entire trace (conversation, request, or workflow):

# Numeric score
client.scores.submit(
    trace_id="trace_123",
    name="overall_quality",
    value=0.92
)

# Categorical score
client.scores.submit(
    trace_id="trace_123",
    name="sentiment",
    value="positive"
)

# Boolean score
client.scores.submit(
    trace_id="trace_123",
    name="contains_pii",
    value=False
)

// Numeric score
await client.scores.submit({
  traceId: 'trace_123',
  name: 'overall_quality',
  value: 0.92
});

// Categorical score
await client.scores.submit({
  traceId: 'trace_123',
  name: 'sentiment',
  value: 'positive'
});

// Boolean score
await client.scores.submit({
  traceId: 'trace_123',
  name: 'contains_pii',
  value: false
});

To Generations

Score a specific LLM generation within a trace:

# Score a specific generation
client.scores.submit(
    trace_id="trace_123",
    span_id="gen_456",  # Generation span ID
    name="factual_accuracy",
    value=0.95,
    comment="All claims verified against source documents"
)

// Score a specific generation
await client.scores.submit({
  traceId: 'trace_123',
  spanId: 'gen_456',  // Generation span ID
  name: 'factual_accuracy',
  value: 0.95,
  comment: 'All claims verified against source documents'
});

Inline Scoring

Add scores during trace creation:

with client.start_as_current_span(name="chat") as span:
    response = llm.generate(prompt)

    # Score inline
    span.score(name="response_length", value=len(response) / 1000)
    span.score(name="tone", value="professional")

await client.startActiveSpan('chat', async (span) => {
  const response = await llm.generate(prompt);

  // Score inline
  span.setAttribute('score_response_length', response.length / 1000);
  span.setAttribute('score_tone', 'professional');

  client.updateCurrentSpan({ output: response });
});

Score Parameters

Parameter	Type	Required	Description
`trace_id`	string	Yes	The trace to score
`span_id`	string	No	Specific span within trace
`name`	string	Yes	Score identifier
`value`	number/string/bool	Yes	The score value
`comment`	string	No	Human-readable explanation
`source`	string	No	Score origin (e.g., "model", "human", "api")
`config_id`	string	No	Evaluator configuration reference

Common Score Patterns

Quality Metrics

# Relevance: How well does the response address the query?
client.scores.submit(trace_id, "relevance", value=0.85)

# Accuracy: Are the facts correct?
client.scores.submit(trace_id, "accuracy", value=0.92)

# Helpfulness: Did it solve the user's problem?
client.scores.submit(trace_id, "helpfulness", value=0.88)

# Coherence: Is the response well-structured?
client.scores.submit(trace_id, "coherence", value=0.90)

Safety Checks

# Toxicity detection
client.scores.submit(trace_id, "toxicity", value=0.02)

# PII detection
client.scores.submit(trace_id, "contains_pii", value=False)

# Prompt injection detection
client.scores.submit(trace_id, "injection_attempt", value=False)

# Content policy
client.scores.submit(trace_id, "policy_compliant", value=True)

Performance Metrics

# Response length (normalized)
client.scores.submit(trace_id, "response_length", value=len(response) / 2000)

# Token efficiency
client.scores.submit(trace_id, "token_efficiency", value=output_tokens / max_tokens)

# First token latency (normalized to target)
client.scores.submit(trace_id, "ttft_score", value=1.0 - (ttft / target_ttft))

RAG-Specific Scores

# Retrieval relevance
client.scores.submit(trace_id, "retrieval_relevance", value=0.85)

# Groundedness: Is the answer supported by retrieved docs?
client.scores.submit(trace_id, "groundedness", value=0.92)

# Citation accuracy
client.scores.submit(trace_id, "citation_accuracy", value=0.88)

# Context utilization
client.scores.submit(trace_id, "context_utilization", value=0.75)

Batch Scoring

Score multiple traces efficiently:

# Score multiple traces
scores = [
    {"trace_id": "trace_1", "name": "quality", "value": 0.85},
    {"trace_id": "trace_2", "name": "quality", "value": 0.92},
    {"trace_id": "trace_3", "name": "quality", "value": 0.78},
]

client.scores.batch(scores)

// Score multiple traces
const scores = [
  { traceId: 'trace_1', name: 'quality', value: 0.85 },
  { traceId: 'trace_2', name: 'quality', value: 0.92 },
  { traceId: 'trace_3', name: 'quality', value: 0.78 },
];

await client.scores.batch(scores);

Score Sources

Track where scores come from:

# From an automated evaluator
client.scores.submit(
    trace_id="trace_123",
    name="toxicity",
    value=0.02,
    source="evaluator:toxicity-v2"
)

# From human review
client.scores.submit(
    trace_id="trace_123",
    name="quality",
    value=0.95,
    source="human:reviewer_42"
)

# From the application
client.scores.submit(
    trace_id="trace_123",
    name="response_time",
    value=0.85,
    source="application"
)

# From LLM-as-judge
client.scores.submit(
    trace_id="trace_123",
    name="helpfulness",
    value=0.88,
    source="llm:gpt-4-judge"
)

Querying Scores

Via API

# Get all scores for a trace
scores = client.get_scores(trace_id="trace_123")

for score in scores:
    print(f"{score.name}: {score.value}")

# Filter by score name
relevance_scores = client.get_scores(
    trace_id="trace_123",
    name="relevance"
)

Via Dashboard

Navigate to Traces → Select a trace
Click the Scores tab
View all attached scores with comments and sources

Aggregations

# Get average scores across traces
from datetime import datetime, timedelta

avg_scores = client.get_score_aggregations(
    project_id="proj_123",
    score_name="relevance",
    start_time=datetime.now() - timedelta(days=7),
    group_by="day"
)

for day in avg_scores:
    print(f"{day.date}: {day.average:.2f}")

Score Normalization

Ensure consistent score ranges:

def normalize_score(raw_value: float, min_val: float, max_val: float) -> float:
    """Normalize to 0-1 range"""
    return (raw_value - min_val) / (max_val - min_val)

# Example: Normalize latency (lower is better)
latency_ms = 150
target_ms = 200
max_ms = 1000

# Invert so lower latency = higher score
latency_score = 1.0 - normalize_score(
    min(latency_ms, max_ms),
    0,
    max_ms
)

client.scores.submit(trace_id, "latency_score", value=latency_score)

Score Thresholds and Alerts

Define quality thresholds:

QUALITY_THRESHOLDS = {
    "relevance": 0.7,
    "accuracy": 0.85,
    "toxicity": 0.1,  # Max acceptable
}

def evaluate_and_alert(trace_id: str, scores: dict):
    for name, value in scores.items():
        client.scores.submit(trace_id=trace_id, name=name, value=value)

        threshold = QUALITY_THRESHOLDS.get(name)
        if threshold:
            # For toxicity, alert if above threshold
            if name == "toxicity" and value > threshold:
                alert_low_quality(trace_id, name, value)
            # For others, alert if below threshold
            elif name != "toxicity" and value < threshold:
                alert_low_quality(trace_id, name, value)

Dashboard alerts for score thresholds are configured in Settings → Alerts. You can set up email, Slack, or webhook notifications.

Best Practices

1. Use Consistent Naming

# Good: Clear, consistent naming
client.scores.submit(trace_id, "response_relevance", value=0.85)
client.scores.submit(trace_id, "response_accuracy", value=0.92)

# Bad: Inconsistent naming
client.scores.submit(trace_id, "rel", value=0.85)
client.scores.submit(trace_id, "isAccurate", value=0.92)

2. Include Comments for Low Scores

if relevance_score < 0.7:
    client.scores.submit(
        trace_id=trace_id,
        name="relevance",
        value=relevance_score,
        comment=f"Response missed key topics: {missing_topics}"
    )

3. Track Score Source

Always indicate where scores come from for debugging:

client.scores.submit(
    trace_id=trace_id,
    name="quality",
    value=0.85,
    source="evaluator:relevance-v2.1"
)

4. Use Appropriate Value Types

# Continuous metric → Numeric
client.scores.submit(trace_id, "similarity", value=0.87)

# Classification → Categorical
client.scores.submit(trace_id, "intent", value="question")

# Binary check → Boolean
client.scores.submit(trace_id, "contains_code", value=True)

Troubleshooting

Score Not Appearing

Verify trace_id exists
Check API key has write permissions
Ensure client.flush() was called

Invalid Score Value

Numeric scores must be between 0.0 and 1.0:

# Bad: Out of range
client.scores.submit(trace_id, "quality", value=85)  # Error!

# Good: Normalized
client.scores.submit(trace_id, "quality", value=0.85)

Duplicate Scores

Multiple scores with the same name on one trace are allowed. Use config_id or source to distinguish:

# Multiple quality assessments from different sources
client.scores.submit(trace_id, "quality", value=0.85, source="evaluator:v1")
client.scores.submit(trace_id, "quality", value=0.82, source="evaluator:v2")

Next Steps

Feedback - Collect user feedback
Built-in Evaluators - Use pre-built scoring functions
Custom Evaluators - Create your own evaluators

Scores

On this page

Scores

On this page