Run experiments, submit scores, and use built-in scorers to evaluate AI applications with the Brokle SDK

Evaluation

Name: Brokle
Author: Brokle

Run experiments, submit scores, and use built-in scorers to evaluate AI applications. Brokle supports two evaluation approaches:

Span-based experiments — Score production LLM calls retrospectively using queried spans
Dataset-based experiments — Run a task function against a dataset and score the outputs

Use client.experiments for running experiments, client.scores for submitting individual scores, and built-in scorers from brokle/scorers.

Quick Start: Span-Based Evaluation

Score production spans without re-executing your LLM calls:

from brokle import Brokle
from brokle.scorers import Contains, LengthCheck

client = Brokle(api_key="bk_...")

# 1. Query production spans
spans = list(client.query.query_iter(
    filter="gen_ai.provider.name=openai",
))

# 2. Run scorers against them
results = client.experiments.run(
    name="production-quality-check",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[Contains(substring="helpful"), LengthCheck(min_length=50)],
)

# 3. View results
print(results.summary)
print(results.url)  # Dashboard link

import { Brokle } from 'brokle';
import { Contains, LengthCheck } from 'brokle/scorers';

const client = new Brokle({ apiKey: 'bk_...' });

// 1. Query production spans
const spans = [];
for await (const span of client.query.queryIter({
  filter: 'gen_ai.provider.name=openai',
})) {
  spans.push(span);
}

// 2. Run scorers against them
const results = await client.experiments.run({
  name: "production-quality-check",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [Contains({ substring: "helpful" }), LengthCheck({ minLength: 50 })],
});

// 3. View results
console.log(results.summary);
console.log(results.url); // Dashboard link

Query production spans, then score them. This is span-based evaluation — no re-instrumentation needed. See Span Query for filter syntax and query options.

Quick Start: Dataset-Based Evaluation

Run a task function against a dataset and score the outputs:

from brokle import Brokle
from brokle.scorers import ExactMatch

client = Brokle(api_key="bk_...")

dataset = client.datasets.get("01HXYZ...")

results = client.experiments.run(
    name="gpt4-accuracy-test",
    dataset=dataset,
    task=lambda input: call_llm(input["question"]),
    scorers=[ExactMatch()],
)

for name, stats in results.summary.items():
    print(f"{name}: mean={stats['mean']:.2f}, pass_rate={stats['pass_rate']:.2%}")

import { Brokle } from 'brokle';
import { ExactMatch } from 'brokle/scorers';

const client = new Brokle({ apiKey: 'bk_...' });

const dataset = await client.datasets.get("01HXYZ...");

const results = await client.experiments.run({
  name: "gpt4-accuracy-test",
  dataset,
  task: async (input) => callLLM(input.question),
  scorers: [ExactMatch()],
});

for (const [name, stats] of Object.entries(results.summary)) {
  console.log(`${name}: mean=${stats.mean.toFixed(2)}, passRate=${(stats.passRate * 100).toFixed(0)}%`);
}

Running Experiments

Span-Based Experiments

Evaluate production spans without re-executing tasks. Provide spans, extractInput, and extractOutput.

results = client.experiments.run(
    name="retrospective-analysis",
    spans=queried_spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    extract_expected=lambda s: s.attributes.get("expected"),  # optional
    scorers=[ExactMatch(), Contains(substring="answer")],
    max_concurrency=10,
    metadata={"team": "ml"},
    on_progress=lambda completed, total: print(f"{completed}/{total}"),
)

const results = await client.experiments.run({
  name: "retrospective-analysis",
  spans: queriedSpans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  extractExpected: (s) => s.attributes.expected, // optional
  scorers: [ExactMatch(), Contains({ substring: "answer" })],
  maxConcurrency: 10,
  metadata: { team: "ml" },
  onProgress: (completed, total) => console.log(`${completed}/${total}`),
});

Dataset-Based Experiments

Run a task function on each dataset item and score the outputs. Provide dataset and task.

results = client.experiments.run(
    name="my-evaluation",
    dataset=dataset,                           # Dataset object or dataset ID string
    task=lambda input: call_llm(input["prompt"]),
    scorers=[ExactMatch(), Contains()],
    max_concurrency=10,
    trial_count=3,                             # Run each item 3 times for variance
    metadata={"model": "gpt-4"},
    on_progress=lambda completed, total: print(f"{completed}/{total}"),
)

const results = await client.experiments.run({
  name: "my-evaluation",
  dataset,                                     // Dataset object or dataset ID string
  task: async (input) => callLLM(input.prompt),
  scorers: [ExactMatch(), Contains()],
  maxConcurrency: 10,
  trialCount: 3,                               // Run each item 3 times for variance
  metadata: { model: "gpt-4" },
  onProgress: (completed, total) => console.log(`${completed}/${total}`),
});

Run Options

Parameter	Type	Default	Description
`name`	`string`	— (required)	Experiment name
`dataset`	`Dataset \| string`	—	Dataset or ID (dataset-based mode)
`task`	`(input) => output`	—	Task function (dataset-based mode)
`spans`	`QueriedSpan[]`	—	Queried spans (span-based mode)
`extractInput`	`(span) => object`	—	Extract input from span (span-based mode)
`extractOutput`	`(span) => any`	—	Extract output from span (span-based mode)
`extractExpected`	`(span) => any`	`None`	Extract expected from span (optional)
`scorers`	`Scorer[]`	— (required)	Scorers to apply
`maxConcurrency`	`number`	`10`	Max concurrent evaluations
`trialCount`	`number`	`1`	Trials per item (dataset-based only)
`metadata`	`object`	`None`	Additional experiment metadata
`onProgress`	`(completed, total) => void`	`None`	Progress callback

dataset/task and spans/extractInput/extractOutput are mutually exclusive. Choose one mode per experiment.

Experiment Results

The run method returns an EvaluationResults object:

Field	Type	Description
`experimentId`	`string`	Created experiment ID
`experimentName`	`string`	Experiment name
`datasetId`	`string?`	Dataset ID (dataset-based only)
`source`	`"dataset" \| "spans"`	Which mode was used
`url`	`string?`	Dashboard link to view results
`summary`	`Record<string, SummaryStats>`	Per-scorer statistics
`items`	`EvaluationItem[]`	Individual item results

SummaryStats for each scorer:

Field	Type	Description
`mean`	`number`	Average score
`stdDev`	`number`	Standard deviation
`min`	`number`	Minimum score
`max`	`number`	Maximum score
`count`	`number`	Total items scored
`passRate`	`number`	Fraction that scored successfully (0-1)

Managing Experiments

Get & List

# Get by ID
experiment = client.experiments.get("01HXYZ...")
print(experiment.name, experiment.status)

# List all experiments
experiments = client.experiments.list(limit=10, page=1)

// Get by ID
const experiment = await client.experiments.get("01HXYZ...");
console.log(experiment.name, experiment.status);

// List all experiments
const experiments = await client.experiments.list({ limit: 10, page: 1 });

Compare Experiments

Compare score metrics across multiple experiments. Optionally specify a baseline for calculating differences.

# Compare two experiments
result = client.experiments.compare(["exp_id_1", "exp_id_2"])

# Compare with baseline
result = client.experiments.compare(
    ["exp_id_1", "exp_id_2", "exp_id_3"],
    baseline_id="exp_id_1",
)

print(result.scores)   # Per-scorer aggregations per experiment
print(result.diffs)    # Differences from baseline

// Compare two experiments
const result = await client.experiments.compare(["exp_id_1", "exp_id_2"]);

// Compare with baseline
const withBaseline = await client.experiments.compare(
  ["exp_id_1", "exp_id_2", "exp_id_3"],
  { baselineId: "exp_id_1" },
);

console.log(result.scores);  // Per-scorer aggregations per experiment
console.log(result.diffs);   // Differences from baseline

Re-run Experiments

Create a new experiment based on an existing one.

# Re-run with a new name
new_exp = client.experiments.rerun("01HXYZ...", name="my-experiment-v2")

# Re-run with auto-generated name
rerun = client.experiments.rerun("01HXYZ...")

// Re-run with a new name
const newExp = await client.experiments.rerun("01HXYZ...", {
  name: "my-experiment-v2",
});

// Re-run with auto-generated name
const rerun = await client.experiments.rerun("01HXYZ...");

Submitting Scores

Use client.scores to submit individual scores to traces or spans outside of experiments.

Direct Score

client.scores.submit(
    trace_id="abc123",
    name="accuracy",
    value=0.95,
    type="NUMERIC",
    source="code",
    reason="High quality response",
    span_id="span456",  # optional: score a specific span
)

await client.scores.submit({
  traceId: "abc123",
  name: "accuracy",
  value: 0.95,
  type: ScoreType.NUMERIC,
  source: ScoreSource.CODE,
  reason: "High quality response",
  spanId: "span456", // optional: score a specific span
});

Score with a Scorer

from brokle.scorers import ExactMatch

exact = ExactMatch(name="answer_match")

client.scores.submit(
    trace_id="abc123",
    scorer=exact,
    output="Paris",
    expected="Paris",
)

import { ExactMatch } from 'brokle/scorers';

const exact = ExactMatch({ name: "answer_match" });

await client.scores.submit({
  traceId: "abc123",
  scorer: exact,
  output: "Paris",
  expected: "Paris",
});

Batch Scores

Submit multiple scores at once.

result = client.scores.batch([
    {"trace_id": "abc123", "name": "accuracy", "value": 0.9},
    {"trace_id": "abc123", "name": "fluency", "value": 0.85},
    {"trace_id": "def456", "name": "relevance", "value": 0.95},
])
print(f"Created {result['created']} scores")

const result = await client.scores.batch([
  { traceId: "abc123", name: "accuracy", value: 0.9 },
  { traceId: "abc123", name: "fluency", value: 0.85 },
  { traceId: "def456", name: "relevance", value: 0.95 },
]);
console.log(`Created ${result.created} scores`);

Built-in Scorers

Brokle ships with heuristic scorers, LLM-as-Judge scorers, and pre-built evaluators.

Heuristic Scorers

Scorer	What it Does	Return Type	Options
`ExactMatch`	`output === expected` (string comparison)	BOOLEAN	`name?`, `caseSensitive?` (default: true)
`Contains`	Output includes substring	BOOLEAN	`name?`, `caseSensitive?`, `substring?`
`RegexMatch`	Output matches regex pattern	BOOLEAN	`pattern` (required), `name?`
`JSONValid`	Output is valid JSON	BOOLEAN	`name?`
`LengthCheck`	Output within min/max length	BOOLEAN	`minLength?`, `maxLength?`, `name?`

from brokle.scorers import ExactMatch, Contains, RegexMatch, JSONValid, LengthCheck

# Exact match (case-insensitive)
exact = ExactMatch(name="answer_match", case_sensitive=False)

# Substring check
contains = Contains(substring="hello")

# Regex pattern
email_check = RegexMatch(pattern=r"[a-z]+@[a-z]+\.[a-z]+", name="has_email")

# JSON validation
json_check = JSONValid()

# Length bounds
length = LengthCheck(min_length=10, max_length=1000)

import { ExactMatch, Contains, RegexMatch, JSONValid, LengthCheck } from 'brokle/scorers';

// Exact match (case-insensitive)
const exact = ExactMatch({ name: "answer_match", caseSensitive: false });

// Substring check
const contains = Contains({ substring: "hello" });

// Regex pattern
const emailCheck = RegexMatch({ pattern: /[a-z]+@[a-z]+\.[a-z]+/i, name: "has_email" });

// JSON validation
const jsonCheck = JSONValid();

// Length bounds
const length = LengthCheck({ minLength: 10, maxLength: 1000 });

LLM-as-Judge Scorer

Use an LLM model to evaluate outputs with custom prompts. Uses your project's AI credentials configured in the Brokle dashboard.

from brokle.scorers import LLMScorer

relevance = LLMScorer(
    client=client,
    name="relevance",
    prompt="Rate the relevance of this response 0-10:\n\nInput: {{input}}\nOutput: {{output}}",
    model="gpt-4o",
)

# Use in experiments
results = client.experiments.run(
    name="relevance-check",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[relevance],
)

import { LLMScorer } from 'brokle/scorers';

const relevance = LLMScorer({
  client: { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' },
  name: 'relevance',
  prompt: 'Rate the relevance of this response 0-10:\n\nInput: {{input}}\nOutput: {{output}}',
  model: 'gpt-4o',
});

// Use in experiments
const results = await client.experiments.run({
  name: "relevance-check",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [relevance],
});

Pre-built Evaluators

Ready-to-use LLM-as-Judge evaluators with standardized prompts for common evaluation criteria.

Category	Evaluators	Description
Factuality	`Factuality`, `Hallucination`	Factual accuracy and hallucination detection
Relevance	`Relevance`, `AnswerRelevance`	Response and Q&A relevance
Quality	`Coherence`, `Fluency`, `Completeness`	Writing quality metrics
Safety	`Safety`, `Toxicity`	Content safety and toxicity
RAG	`ContextPrecision`, `ContextRecall`, `Faithfulness`	RAG pipeline quality

from brokle.scorers import Factuality, Relevance, Coherence, Safety

# Create evaluators (all use your project's AI credentials)
factuality = Factuality(client=client, model="gpt-4o")
relevance = Relevance(client=client, model="gpt-4o")
coherence = Coherence(client=client, model="gpt-4o")
safety = Safety(client=client, model="gpt-4o")

# Use in experiments
results = client.experiments.run(
    name="comprehensive-eval",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[factuality, relevance, coherence, safety],
)

import { Factuality, Relevance, Coherence, Safety } from 'brokle/scorers';

const config = { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' };

// Create evaluators
const factuality = Factuality({ client: config, model: 'gpt-4o' });
const relevance = Relevance({ client: config, model: 'gpt-4o' });
const coherence = Coherence({ client: config, model: 'gpt-4o' });
const safety = Safety({ client: config, model: 'gpt-4o' });

// Use in experiments
const results = await client.experiments.run({
  name: "comprehensive-eval",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [factuality, relevance, coherence, safety],
});

You can also create evaluators dynamically by name:

from brokle.scorers import create_evaluator, list_evaluators

# List available evaluators
print(list_evaluators())
# ['factuality', 'hallucination', 'relevance', 'answer_relevance', ...]

# Create by name
evaluator = create_evaluator("factuality", client=client, model="gpt-4o")

import { createEvaluator, listEvaluators } from 'brokle/scorers';

// List available evaluators
console.log(listEvaluators());
// ['factuality', 'hallucination', 'relevance', 'answer_relevance', ...]

// Create by name
const evaluator = createEvaluator("factuality", {
  client: { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' },
  model: 'gpt-4o',
});

Custom Scorers

Create custom scorers using the scorer decorator/factory:

from brokle.scorers import scorer, multi_scorer

# Single-score custom scorer
@scorer
def word_count(output, expected=None, **kwargs):
    count = len(str(output).split())
    return count / 100  # Normalize to 0-1

# Multi-score custom scorer
@multi_scorer
def quality_metrics(output, expected=None, **kwargs):
    text = str(output)
    return [
        {"name": "word_count", "value": len(text.split()) / 100, "type": "NUMERIC"},
        {"name": "has_greeting", "value": 1 if "hello" in text.lower() else 0, "type": "BOOLEAN"},
    ]

import { scorer, multiScorer } from 'brokle/scorers';

// Single-score custom scorer
const wordCount = scorer("word_count", ({ output }) => {
  const count = String(output ?? "").split(/\s+/).length;
  return count / 100; // Normalize to 0-1
});

// Multi-score custom scorer
const qualityMetrics = multiScorer("quality", ({ output }) => {
  const text = String(output ?? "");
  return [
    { name: "word_count", value: text.split(/\s+/).length / 100, type: ScoreType.NUMERIC },
    { name: "has_greeting", value: text.toLowerCase().includes("hello") ? 1 : 0, type: ScoreType.BOOLEAN },
  ];
});

Score Types & Sources

Score Types

Type	Value	Description
`NUMERIC`	`number` (0-1 typical)	Continuous score
`BOOLEAN`	`0` or `1`	Pass/fail
`CATEGORICAL`	`number` + `stringValue`	Named category

Score Sources

Source	Description
`code`	Computed by code (heuristic scorers)
`llm`	Evaluated by LLM (LLM-as-Judge)
`human`	Provided by human reviewers

ExperimentsManager Reference

Method	Parameters	Returns	Description
`run`	See Run Options	`EvaluationResults`	Run an experiment
`get`	`experimentId`	`Experiment`	Get experiment by ID
`list`	`limit?`, `page?`	`Experiment[]`	List experiments
`compare`	`experimentIds[]`, `baselineId?`	`ComparisonResult`	Compare experiments
`rerun`	`experimentId`, `name?`, `description?`, `metadata?`	`Experiment`	Re-run an experiment

ScoresManager Reference

Method	Parameters	Returns	Description
`submit`	`traceId`, `name?`, `value?`, `scorer?`, `output?`, `expected?`, `type?`, `source?`, `spanId?`, `reason?`, `metadata?`	`ScoreResponse`	Submit a score
`batch`	`scores[]` — `{traceId, name, value, type?, source?, spanId?, reason?, metadata?}`	`{ created: number }`	Submit multiple scores

Span Query — Query production spans for span-based evaluation
Datasets — Create and manage evaluation datasets
Annotation Queues — Route items to human reviewers
Evaluation Concepts — Conceptual overview of evaluation workflows

Evaluation

Run experiments, submit scores, and use built-in scorers to evaluate AI applications. Brokle supports two evaluation approaches:

Span-based experiments — Score production LLM calls retrospectively using queried spans
Dataset-based experiments — Run a task function against a dataset and score the outputs

Use client.experiments for running experiments, client.scores for submitting individual scores, and built-in scorers from brokle/scorers.

Quick Start: Span-Based Evaluation

Score production spans without re-executing your LLM calls:

from brokle import Brokle
from brokle.scorers import Contains, LengthCheck

client = Brokle(api_key="bk_...")

# 1. Query production spans
spans = list(client.query.query_iter(
    filter="gen_ai.provider.name=openai",
))

# 2. Run scorers against them
results = client.experiments.run(
    name="production-quality-check",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[Contains(substring="helpful"), LengthCheck(min_length=50)],
)

# 3. View results
print(results.summary)
print(results.url)  # Dashboard link

import { Brokle } from 'brokle';
import { Contains, LengthCheck } from 'brokle/scorers';

const client = new Brokle({ apiKey: 'bk_...' });

// 1. Query production spans
const spans = [];
for await (const span of client.query.queryIter({
  filter: 'gen_ai.provider.name=openai',
})) {
  spans.push(span);
}

// 2. Run scorers against them
const results = await client.experiments.run({
  name: "production-quality-check",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [Contains({ substring: "helpful" }), LengthCheck({ minLength: 50 })],
});

// 3. View results
console.log(results.summary);
console.log(results.url); // Dashboard link

Query production spans, then score them. This is span-based evaluation — no re-instrumentation needed. See Span Query for filter syntax and query options.

Quick Start: Dataset-Based Evaluation

Run a task function against a dataset and score the outputs:

from brokle import Brokle
from brokle.scorers import ExactMatch

client = Brokle(api_key="bk_...")

dataset = client.datasets.get("01HXYZ...")

results = client.experiments.run(
    name="gpt4-accuracy-test",
    dataset=dataset,
    task=lambda input: call_llm(input["question"]),
    scorers=[ExactMatch()],
)

for name, stats in results.summary.items():
    print(f"{name}: mean={stats['mean']:.2f}, pass_rate={stats['pass_rate']:.2%}")

import { Brokle } from 'brokle';
import { ExactMatch } from 'brokle/scorers';

const client = new Brokle({ apiKey: 'bk_...' });

const dataset = await client.datasets.get("01HXYZ...");

const results = await client.experiments.run({
  name: "gpt4-accuracy-test",
  dataset,
  task: async (input) => callLLM(input.question),
  scorers: [ExactMatch()],
});

for (const [name, stats] of Object.entries(results.summary)) {
  console.log(`${name}: mean=${stats.mean.toFixed(2)}, passRate=${(stats.passRate * 100).toFixed(0)}%`);
}

Running Experiments

Span-Based Experiments

Evaluate production spans without re-executing tasks. Provide spans, extractInput, and extractOutput.

results = client.experiments.run(
    name="retrospective-analysis",
    spans=queried_spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    extract_expected=lambda s: s.attributes.get("expected"),  # optional
    scorers=[ExactMatch(), Contains(substring="answer")],
    max_concurrency=10,
    metadata={"team": "ml"},
    on_progress=lambda completed, total: print(f"{completed}/{total}"),
)

const results = await client.experiments.run({
  name: "retrospective-analysis",
  spans: queriedSpans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  extractExpected: (s) => s.attributes.expected, // optional
  scorers: [ExactMatch(), Contains({ substring: "answer" })],
  maxConcurrency: 10,
  metadata: { team: "ml" },
  onProgress: (completed, total) => console.log(`${completed}/${total}`),
});

Dataset-Based Experiments

Run a task function on each dataset item and score the outputs. Provide dataset and task.

results = client.experiments.run(
    name="my-evaluation",
    dataset=dataset,                           # Dataset object or dataset ID string
    task=lambda input: call_llm(input["prompt"]),
    scorers=[ExactMatch(), Contains()],
    max_concurrency=10,
    trial_count=3,                             # Run each item 3 times for variance
    metadata={"model": "gpt-4"},
    on_progress=lambda completed, total: print(f"{completed}/{total}"),
)

const results = await client.experiments.run({
  name: "my-evaluation",
  dataset,                                     // Dataset object or dataset ID string
  task: async (input) => callLLM(input.prompt),
  scorers: [ExactMatch(), Contains()],
  maxConcurrency: 10,
  trialCount: 3,                               // Run each item 3 times for variance
  metadata: { model: "gpt-4" },
  onProgress: (completed, total) => console.log(`${completed}/${total}`),
});

Run Options

Parameter	Type	Default	Description
`name`	`string`	— (required)	Experiment name
`dataset`	`Dataset \| string`	—	Dataset or ID (dataset-based mode)
`task`	`(input) => output`	—	Task function (dataset-based mode)
`spans`	`QueriedSpan[]`	—	Queried spans (span-based mode)
`extractInput`	`(span) => object`	—	Extract input from span (span-based mode)
`extractOutput`	`(span) => any`	—	Extract output from span (span-based mode)
`extractExpected`	`(span) => any`	`None`	Extract expected from span (optional)
`scorers`	`Scorer[]`	— (required)	Scorers to apply
`maxConcurrency`	`number`	`10`	Max concurrent evaluations
`trialCount`	`number`	`1`	Trials per item (dataset-based only)
`metadata`	`object`	`None`	Additional experiment metadata
`onProgress`	`(completed, total) => void`	`None`	Progress callback

dataset/task and spans/extractInput/extractOutput are mutually exclusive. Choose one mode per experiment.

Experiment Results

The run method returns an EvaluationResults object:

Field	Type	Description
`experimentId`	`string`	Created experiment ID
`experimentName`	`string`	Experiment name
`datasetId`	`string?`	Dataset ID (dataset-based only)
`source`	`"dataset" \| "spans"`	Which mode was used
`url`	`string?`	Dashboard link to view results
`summary`	`Record<string, SummaryStats>`	Per-scorer statistics
`items`	`EvaluationItem[]`	Individual item results

SummaryStats for each scorer:

Field	Type	Description
`mean`	`number`	Average score
`stdDev`	`number`	Standard deviation
`min`	`number`	Minimum score
`max`	`number`	Maximum score
`count`	`number`	Total items scored
`passRate`	`number`	Fraction that scored successfully (0-1)

Managing Experiments

Get & List

# Get by ID
experiment = client.experiments.get("01HXYZ...")
print(experiment.name, experiment.status)

# List all experiments
experiments = client.experiments.list(limit=10, page=1)

// Get by ID
const experiment = await client.experiments.get("01HXYZ...");
console.log(experiment.name, experiment.status);

// List all experiments
const experiments = await client.experiments.list({ limit: 10, page: 1 });

Compare Experiments

Compare score metrics across multiple experiments. Optionally specify a baseline for calculating differences.

# Compare two experiments
result = client.experiments.compare(["exp_id_1", "exp_id_2"])

# Compare with baseline
result = client.experiments.compare(
    ["exp_id_1", "exp_id_2", "exp_id_3"],
    baseline_id="exp_id_1",
)

print(result.scores)   # Per-scorer aggregations per experiment
print(result.diffs)    # Differences from baseline

// Compare two experiments
const result = await client.experiments.compare(["exp_id_1", "exp_id_2"]);

// Compare with baseline
const withBaseline = await client.experiments.compare(
  ["exp_id_1", "exp_id_2", "exp_id_3"],
  { baselineId: "exp_id_1" },
);

console.log(result.scores);  // Per-scorer aggregations per experiment
console.log(result.diffs);   // Differences from baseline

Re-run Experiments

Create a new experiment based on an existing one.

# Re-run with a new name
new_exp = client.experiments.rerun("01HXYZ...", name="my-experiment-v2")

# Re-run with auto-generated name
rerun = client.experiments.rerun("01HXYZ...")

// Re-run with a new name
const newExp = await client.experiments.rerun("01HXYZ...", {
  name: "my-experiment-v2",
});

// Re-run with auto-generated name
const rerun = await client.experiments.rerun("01HXYZ...");

Submitting Scores

Use client.scores to submit individual scores to traces or spans outside of experiments.

Direct Score

client.scores.submit(
    trace_id="abc123",
    name="accuracy",
    value=0.95,
    type="NUMERIC",
    source="code",
    reason="High quality response",
    span_id="span456",  # optional: score a specific span
)

await client.scores.submit({
  traceId: "abc123",
  name: "accuracy",
  value: 0.95,
  type: ScoreType.NUMERIC,
  source: ScoreSource.CODE,
  reason: "High quality response",
  spanId: "span456", // optional: score a specific span
});

Score with a Scorer

from brokle.scorers import ExactMatch

exact = ExactMatch(name="answer_match")

client.scores.submit(
    trace_id="abc123",
    scorer=exact,
    output="Paris",
    expected="Paris",
)

import { ExactMatch } from 'brokle/scorers';

const exact = ExactMatch({ name: "answer_match" });

await client.scores.submit({
  traceId: "abc123",
  scorer: exact,
  output: "Paris",
  expected: "Paris",
});

Batch Scores

Submit multiple scores at once.

result = client.scores.batch([
    {"trace_id": "abc123", "name": "accuracy", "value": 0.9},
    {"trace_id": "abc123", "name": "fluency", "value": 0.85},
    {"trace_id": "def456", "name": "relevance", "value": 0.95},
])
print(f"Created {result['created']} scores")

const result = await client.scores.batch([
  { traceId: "abc123", name: "accuracy", value: 0.9 },
  { traceId: "abc123", name: "fluency", value: 0.85 },
  { traceId: "def456", name: "relevance", value: 0.95 },
]);
console.log(`Created ${result.created} scores`);

Built-in Scorers

Brokle ships with heuristic scorers, LLM-as-Judge scorers, and pre-built evaluators.

Heuristic Scorers

Scorer	What it Does	Return Type	Options
`ExactMatch`	`output === expected` (string comparison)	BOOLEAN	`name?`, `caseSensitive?` (default: true)
`Contains`	Output includes substring	BOOLEAN	`name?`, `caseSensitive?`, `substring?`
`RegexMatch`	Output matches regex pattern	BOOLEAN	`pattern` (required), `name?`
`JSONValid`	Output is valid JSON	BOOLEAN	`name?`
`LengthCheck`	Output within min/max length	BOOLEAN	`minLength?`, `maxLength?`, `name?`

from brokle.scorers import ExactMatch, Contains, RegexMatch, JSONValid, LengthCheck

# Exact match (case-insensitive)
exact = ExactMatch(name="answer_match", case_sensitive=False)

# Substring check
contains = Contains(substring="hello")

# Regex pattern
email_check = RegexMatch(pattern=r"[a-z]+@[a-z]+\.[a-z]+", name="has_email")

# JSON validation
json_check = JSONValid()

# Length bounds
length = LengthCheck(min_length=10, max_length=1000)

import { ExactMatch, Contains, RegexMatch, JSONValid, LengthCheck } from 'brokle/scorers';

// Exact match (case-insensitive)
const exact = ExactMatch({ name: "answer_match", caseSensitive: false });

// Substring check
const contains = Contains({ substring: "hello" });

// Regex pattern
const emailCheck = RegexMatch({ pattern: /[a-z]+@[a-z]+\.[a-z]+/i, name: "has_email" });

// JSON validation
const jsonCheck = JSONValid();

// Length bounds
const length = LengthCheck({ minLength: 10, maxLength: 1000 });

LLM-as-Judge Scorer

Use an LLM model to evaluate outputs with custom prompts. Uses your project's AI credentials configured in the Brokle dashboard.

from brokle.scorers import LLMScorer

relevance = LLMScorer(
    client=client,
    name="relevance",
    prompt="Rate the relevance of this response 0-10:\n\nInput: {{input}}\nOutput: {{output}}",
    model="gpt-4o",
)

# Use in experiments
results = client.experiments.run(
    name="relevance-check",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[relevance],
)

import { LLMScorer } from 'brokle/scorers';

const relevance = LLMScorer({
  client: { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' },
  name: 'relevance',
  prompt: 'Rate the relevance of this response 0-10:\n\nInput: {{input}}\nOutput: {{output}}',
  model: 'gpt-4o',
});

// Use in experiments
const results = await client.experiments.run({
  name: "relevance-check",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [relevance],
});

Pre-built Evaluators

Ready-to-use LLM-as-Judge evaluators with standardized prompts for common evaluation criteria.

Category	Evaluators	Description
Factuality	`Factuality`, `Hallucination`	Factual accuracy and hallucination detection
Relevance	`Relevance`, `AnswerRelevance`	Response and Q&A relevance
Quality	`Coherence`, `Fluency`, `Completeness`	Writing quality metrics
Safety	`Safety`, `Toxicity`	Content safety and toxicity
RAG	`ContextPrecision`, `ContextRecall`, `Faithfulness`	RAG pipeline quality

from brokle.scorers import Factuality, Relevance, Coherence, Safety

# Create evaluators (all use your project's AI credentials)
factuality = Factuality(client=client, model="gpt-4o")
relevance = Relevance(client=client, model="gpt-4o")
coherence = Coherence(client=client, model="gpt-4o")
safety = Safety(client=client, model="gpt-4o")

# Use in experiments
results = client.experiments.run(
    name="comprehensive-eval",
    spans=spans,
    extract_input=lambda s: {"prompt": s.input},
    extract_output=lambda s: s.output,
    scorers=[factuality, relevance, coherence, safety],
)

import { Factuality, Relevance, Coherence, Safety } from 'brokle/scorers';

const config = { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' };

// Create evaluators
const factuality = Factuality({ client: config, model: 'gpt-4o' });
const relevance = Relevance({ client: config, model: 'gpt-4o' });
const coherence = Coherence({ client: config, model: 'gpt-4o' });
const safety = Safety({ client: config, model: 'gpt-4o' });

// Use in experiments
const results = await client.experiments.run({
  name: "comprehensive-eval",
  spans,
  extractInput: (s) => ({ prompt: s.input }),
  extractOutput: (s) => s.output,
  scorers: [factuality, relevance, coherence, safety],
});

You can also create evaluators dynamically by name:

from brokle.scorers import create_evaluator, list_evaluators

# List available evaluators
print(list_evaluators())
# ['factuality', 'hallucination', 'relevance', 'answer_relevance', ...]

# Create by name
evaluator = create_evaluator("factuality", client=client, model="gpt-4o")

import { createEvaluator, listEvaluators } from 'brokle/scorers';

// List available evaluators
console.log(listEvaluators());
// ['factuality', 'hallucination', 'relevance', 'answer_relevance', ...]

// Create by name
const evaluator = createEvaluator("factuality", {
  client: { apiKey: 'bk_...', baseUrl: 'https://api.brokle.com' },
  model: 'gpt-4o',
});

Custom Scorers

Create custom scorers using the scorer decorator/factory:

from brokle.scorers import scorer, multi_scorer

# Single-score custom scorer
@scorer
def word_count(output, expected=None, **kwargs):
    count = len(str(output).split())
    return count / 100  # Normalize to 0-1

# Multi-score custom scorer
@multi_scorer
def quality_metrics(output, expected=None, **kwargs):
    text = str(output)
    return [
        {"name": "word_count", "value": len(text.split()) / 100, "type": "NUMERIC"},
        {"name": "has_greeting", "value": 1 if "hello" in text.lower() else 0, "type": "BOOLEAN"},
    ]

import { scorer, multiScorer } from 'brokle/scorers';

// Single-score custom scorer
const wordCount = scorer("word_count", ({ output }) => {
  const count = String(output ?? "").split(/\s+/).length;
  return count / 100; // Normalize to 0-1
});

// Multi-score custom scorer
const qualityMetrics = multiScorer("quality", ({ output }) => {
  const text = String(output ?? "");
  return [
    { name: "word_count", value: text.split(/\s+/).length / 100, type: ScoreType.NUMERIC },
    { name: "has_greeting", value: text.toLowerCase().includes("hello") ? 1 : 0, type: ScoreType.BOOLEAN },
  ];
});

Score Types & Sources

Score Types

Type	Value	Description
`NUMERIC`	`number` (0-1 typical)	Continuous score
`BOOLEAN`	`0` or `1`	Pass/fail
`CATEGORICAL`	`number` + `stringValue`	Named category

Score Sources

Source	Description
`code`	Computed by code (heuristic scorers)
`llm`	Evaluated by LLM (LLM-as-Judge)
`human`	Provided by human reviewers

ExperimentsManager Reference

Method	Parameters	Returns	Description
`run`	See Run Options	`EvaluationResults`	Run an experiment
`get`	`experimentId`	`Experiment`	Get experiment by ID
`list`	`limit?`, `page?`	`Experiment[]`	List experiments
`compare`	`experimentIds[]`, `baselineId?`	`ComparisonResult`	Compare experiments
`rerun`	`experimentId`, `name?`, `description?`, `metadata?`	`Experiment`	Re-run an experiment

ScoresManager Reference

Method	Parameters	Returns	Description
`submit`	`traceId`, `name?`, `value?`, `scorer?`, `output?`, `expected?`, `type?`, `source?`, `spanId?`, `reason?`, `metadata?`	`ScoreResponse`	Submit a score
`batch`	`scores[]` — `{traceId, name, value, type?, source?, spanId?, reason?, metadata?}`	`{ created: number }`	Submit multiple scores

Span Query — Query production spans for span-based evaluation
Datasets — Create and manage evaluation datasets
Annotation Queues — Route items to human reviewers
Evaluation Concepts — Conceptual overview of evaluation workflows

Evaluation

On this page

Evaluation

On this page