Skip to main content
Ask AI

Datasets and Goldens

Store your evaluation inputs (goldens) in TestRelic so they are versioned, shared across your team, and pinnable to a specific eval run. You manage datasets with the testrelic.datasets helpers; TestRelic handles storage in the cloud.

note

Pulling a dataset returns a DeepEval EvaluationDataset, so DeepEval must be installed — pip install "testrelic-deepeval[deepeval]".

Push a dataset

Each push creates a new version under the same alias:

push_dataset.py
import testrelic

testrelic.datasets.push(
alias="customer-support-goldens",
label="latest",
description="Top 50 support questions with verified answers",
goldens=[
{
"input": "How do I reset my password?",
"expected_output": "Use the 'Forgot password' link on the login page.",
},
{
"input": "What is your refund policy?",
"expected_output": "Full refund within 30 days of purchase.",
},
],
)

A label such as latest, production, or experiment-A moves to the new version atomically, so consumers reading that label always get a consistent snapshot.

Pull a dataset

Pulling returns a DeepEval EvaluationDataset you can iterate as usual:

run_eval.py
import testrelic
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

dataset = testrelic.datasets.pull("customer-support-goldens", label="latest")

for golden in dataset.evals_iterator(
metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
):
golden.actual_output = my_llm_pipeline(golden.input)

Pin a run to a specific version

To keep a CI run from moving with latest, pull a fixed label instead:

run_eval.py
dataset = testrelic.datasets.pull("customer-support-goldens", label="v2026-05")

List datasets

list_datasets.py
import testrelic

for ds in testrelic.datasets.list_datasets():
print(ds["alias"], ds["latestVersion"], ds["goldensCount"])

Next steps