Datasets and Goldens

Store your evaluation inputs (goldens) in TestRelic so they are versioned, shared across your team, and pinnable to a specific eval run. You manage datasets with the testrelic.datasets helpers; TestRelic handles storage in the cloud.

note

Pulling a dataset returns a DeepEval EvaluationDataset, so DeepEval must be installed — pip install "testrelic-deepeval[deepeval]".

Push a dataset

Each push creates a new version under the same alias:

push_dataset.py
import testrelic

testrelic.datasets.push(
    alias="customer-support-goldens",
    label="latest",
    description="Top 50 support questions with verified answers",
    goldens=[
        {
            "input": "How do I reset my password?",
            "expected_output": "Use the 'Forgot password' link on the login page.",
        },
        {
            "input": "What is your refund policy?",
            "expected_output": "Full refund within 30 days of purchase.",
        },
    ],
)

A label such as latest, production, or experiment-A moves to the new version atomically, so consumers reading that label always get a consistent snapshot.

Pull a dataset

Pulling returns a DeepEval EvaluationDataset you can iterate as usual:

run_eval.py
import testrelic
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

dataset = testrelic.datasets.pull("customer-support-goldens", label="latest")

for golden in dataset.evals_iterator(
    metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
):
    golden.actual_output = my_llm_pipeline(golden.input)

Pin a run to a specific version

To keep a CI run from moving with latest, pull a fixed label instead:

run_eval.py
dataset = testrelic.datasets.pull("customer-support-goldens", label="v2026-05")

List datasets

list_datasets.py
import testrelic

for ds in testrelic.datasets.list_datasets():
    print(ds["alias"], ds["latestVersion"], ds["goldensCount"])

Push a dataset​

Pull a dataset​

Pin a run to a specific version​

List datasets​

Next steps​

Push a dataset

Pull a dataset

Pin a run to a specific version

List datasets

Next steps