pytest-wardenbot¶
Pytest plugin for testing chatbots and LLM apps. Run a curated test suite against your chatbot to catch behavior regressions before your customers do.
Pre-release
v0.1.2 is in active development. APIs may change before the first stable release. See the changelog for what's landed.
Why this exists¶
You added a chatbot to your product. Now you want to know:
- Does it follow known prompt-override patterns?
- Does it ever reveal its system prompt?
- Does it stay polite when users push back?
- Does it remember your real prices, hours, and policies?
- Does it stay on-topic instead of writing essays about quantum physics?
pytest-wardenbot runs those checks as plain pytest tests against your
live chatbot. Failures include a structured Markdown block you paste into
Cursor or Claude Code to fix the issue.
Two-minute tour¶
pip install pytest-wardenbot
pytest --wardenbot-quickstart # generates conftest.py + test_my_bot.py
export CHATBOT_URL=https://your-chatbot.example.com/chat
pytest
That's it. The shipped suite ran. Read the failures, paste the remediation Markdown into your IDE, iterate.
Full quickstart → See what's tested →
What "passing" means (and doesn't)¶
A green run means your chatbot didn't fail any of the bundled 29 attacks in the most overt way. It's a useful smoke test and a regression detector — if a deploy turns a green test red, that's a real signal to investigate.
A green run does not mean your chatbot is secure. Frontier-grade attacks are multi-turn, novel, and adapted to your specific bot — no fixed corpus catches all of them. Treat the shipped suite as a starter set: pair it with periodic red-team exercises (or our Continuous Monitoring service) for the always-on adversarial coverage CI alone can't provide.
How it compares¶
-
vs Promptfoo (acquired by OpenAI in Feb 2026): Promptfoo is a developer testing CLI. We're a pytest plugin — same test runner your existing suite uses, same CI integration you already have. Same idea, different shape.
-
vs DeepEval: DeepEval focuses on evaluation metrics (faithfulness, relevancy). We focus on behavior testing (jailbreak resistance, system-prompt protection, brand drift). We use DeepEval under the hood for our optional LLM-judge tests.
-
vs Garak / PyRIT: Garak and PyRIT are research-grade libraries. We package a curated subset as everyday tests with clear failure messages.
Powered by¶
WardenBot AI — continuous external monitoring for AI chatbots. The pytest plugin is the free, open-source slice of our test corpus. Want continuous monitoring across all your bots with daily probes, a dashboard, and alerts? Tell us about your setup.