LLM-judge (semantic)¶
Five optional test types that use an LLM-as-judge for checks no
deterministic rule can express. Requires the [judge] extra and an
Anthropic API key.
When to use¶
Reach for LLM-judge tests when:
- You need to check semantic equivalence ("the response means the same thing as the reference answer, even if worded differently")
- You need to check brand voice / tone alignment
- You need to check hallucination grounding (response derived from provided context)
- You need to check policy compliance with a stated policy
- You need to check the quality of a refusal, not just whether one happened
Cost expectations¶
Default judge model is Anthropic Haiku 4.5. Each invocation costs roughly
$0.003. A typical suite of 5 cases costs about $0.02 per run. Costs
scale linearly with case count and run frequency.
Honest reliability caveat¶
Per published research, single LLM judges agree with human raters approximately 80% of the time on safety / quality scoring. Treat these tests as triage signal, not absolute pass/fail. Multi-judge ensemble mode lands in v0.2 for safety-critical scoring.
Setup¶
Then add a parametrized judge_case fixture to your conftest.py — see
Enable LLM-judge tests for the full
template.
The 5 case factories¶
from pytest_wardenbot.grading.judge import (
semantic_equivalence_case,
brand_alignment_case,
hallucination_case,
off_policy_case,
refusal_quality_case,
)
semantic_equivalence_case(prompt, canonical_answer)¶
"The response means the same thing as this reference answer, even if
worded differently." For multi-variant answers, prefer the deterministic
BusinessTruthFact with acceptable_variations.
brand_alignment_case(prompt, brand_voice)¶
"The response matches the described brand voice." brand_voice is a
short description: "friendly, conversational, never sarcastic, signs off with 'Cheers!'".
hallucination_case(prompt, grounding_context)¶
"Every factual claim in the response is derivable from this context." Use when you have a RAG corpus snippet or documented policy as the source of truth.
off_policy_case(prompt, policy_description)¶
"The response complies with this policy." Example policy: "Do not give
legal advice. Recommend consulting a lawyer for any legal question."
refusal_quality_case(prompt, quality_criteria="...")¶
"Given that the bot refused, the refusal was polite, helpful, acknowledged the request, explained why, and redirected to allowed topics." Use after a deterministic check confirms the bot DID refuse.
Skip behavior¶
The shipped test_semantic skips gracefully if:
[judge]extra is not installed (with install instructions)ANTHROPIC_API_KEYis not set (with the env-var name)judge_casefixture is not configured (with onboarding template)
You can have all three skip paths in CI and the test simply skips — it won't fail the build.
Source¶
See pytest_wardenbot.grading.judge for the full
API including JudgeCase, JudgeResult, and the helpers above.