LLM Evaluations

TestRelic ingests LLM-evaluation runs alongside your browser and API tests. Eval runs come from the testrelic-deepeval Python SDK and render through the same shared run UI as regular tests — there is no separate, bespoke evaluation interface to learn.

What an eval run is

An eval run is a single execution of an LLM-evaluation suite, made up of one or more evaluation cases. Each case records whether it passed or failed against its metrics, the same way a test case records pass/fail/flaky status. Because eval runs share the run model with tests, they flow into the platform's existing views automatically.

Shared Run Detail and Session Workspace

Eval runs open in the standard Run Detail page and Session Workspace. From a run you can drill into per-case detail and review each case's Test History over time — exactly as you would for a browser or API test.

Evaluations workspace

DeepEval runs also get a dedicated Evaluations workspace focused on eval-specific analysis. At the top, an Evaluations KPI summary strip rolls up the headline numbers for the current view — pass rate, case counts, and stability at a glance — before you drill into individual runs and cases.

Repository Evaluations tab

A repository that contains eval data gets an Evaluations tab alongside its Test Cases and Test Runs tabs. The tab lists that repository's eval runs with per-repo eval stats and tags, so you can scan stability and pass rate and filter by tag without leaving the repository. Click through to per-case detail and per-case Test History. See Repositories for the full repository-detail layout.

Unified Test Runs feed

The org-wide Test Runs dashboard can include eval rows through an opt-in include evals toggle. When enabled, eval runs appear in the unified feed with proper eval naming, mixed in with your browser and API runs.

Eval Stability

Eval Stability is a metric surfaced for eval runs and repositories. It tracks how consistently eval cases pass across runs over time — analogous to flakiness and pass-rate stability for tests. Use it to spot evaluations whose outcomes drift or fluctuate between runs rather than holding steady.

note

Eval Stability is a qualitative indicator of consistency over a window of recent runs, not a single-run score. Like other health metrics, it updates as new eval runs are ingested.

Sending evals from the SDK

You get evals into TestRelic with the DeepEval Python SDK. Configure the SDK with your repository's API key and run your evaluations — the runs then appear in the views above.

See DeepEval / LLM Evaluations for installation and configuration.

What an eval run is​

Shared Run Detail and Session Workspace​

Evaluations workspace​

Repository Evaluations tab​

Unified Test Runs feed​

Eval Stability​

Sending evals from the SDK​