LLM Evaluations
TestRelic ingests LLM-evaluation runs alongside your browser and API tests. Eval runs come from the testrelic-deepeval Python SDK and render through the same shared run UI as regular tests — there is no separate, bespoke evaluation interface to learn.
What an eval run is
An eval run is a single execution of an LLM-evaluation suite, made up of one or more evaluation cases. Each case records whether it passed or failed against its metrics, the same way a test case records pass/fail/flaky status. Because eval runs share the run model with tests, they flow into the platform's existing views automatically.
Shared Run Detail and Session Workspace
Eval runs open in the standard Run Detail page and Session Workspace. From a run you can drill into per-case detail and review each case's Test History over time — exactly as you would for a browser or API test.
Repository Evaluations tab
A repository that contains eval data gets an Evaluations tab alongside its Test Cases and Test Runs tabs. The tab lists that repository's eval runs; click through to per-case detail and per-case Test History. See Repositories for the full repository-detail layout.
Unified Test Runs feed
The org-wide Test Runs dashboard can include eval rows through an opt-in include evals toggle. When enabled, eval runs appear in the unified feed with proper eval naming, mixed in with your browser and API runs.
Eval Stability
Eval Stability is a metric surfaced for eval runs and repositories. It tracks how consistently eval cases pass across runs over time — analogous to flakiness and pass-rate stability for tests. Use it to spot evaluations whose outcomes drift or fluctuate between runs rather than holding steady.
Eval Stability is a qualitative indicator of consistency over a window of recent runs, not a single-run score. Like other health metrics, it updates as new eval runs are ingested.
Sending evals from the SDK
You get evals into TestRelic with the DeepEval Python SDK. Configure the SDK with your repository's API key and run your evaluations — the runs then appear in the views above.
See DeepEval / LLM Evaluations for installation and configuration.