I’ve been spending a lot of time with garak lately, both for client work and for my own OSAI prep. Figured it was worth a writeup.
what it is
garak is an LLM vulnerability scanner from NVIDIA (originally written by Leon Derczynski before NVIDIA picked it up). The README pitches it as something like nmap or Metasploit but for LLMs, which is a fair comparison if you squint. You point it at a model, it runs a battery of probes against it, detectors flag anything that looks like a hit, and you get a JSONL report at the end with per-probe pass/fail rates.
The probe categories cover most of what you’d want to test on a model or LLM-backed app: prompt injection (including the original Agency Enterprise PromptInject framework that won best paper at the NeurIPS ML Safety Workshop 2022), jailbreaks (DAN variants), encoding attacks, training data leakage (leakreplay), package hallucination, malware generation, XSS payload generation, toxicity, and a bunch more. Worth listing them all once with --list_probes so you know what’s in the box.
why I use it
Most of my day job is FedRAMP cloud sec work, but a growing chunk is AI/LLM testing. The LLM space moves fast and the threat surface keeps changing. Having a tool that gives me a repeatable, scriptable baseline against any target is huge. A few of the wins:
- It hits a lot of well-known weakness classes in one run, which is a reasonable starting point before I go manual
- Output is JSONL so it’s easy to grep, parse, or feed into a report
- Supports OpenAI, Hugging Face, AWS Bedrock, Replicate, LiteLLM, GGUF (llama.cpp), and basically anything reachable over REST, so it slots into whatever the client is running
- Probes get added and tuned over time, so an old score becomes a worse score on the same model later, which I think is a feature not a bug
It’s also one of the tools I’m leaning on for OSAI prep, since the probe taxonomy maps reasonably well to the LLM01 through LLM10 list from OWASP.
install
I run CachyOS with Fish shell. I keep garak in its own venv to avoid dependency wrestling, since the ML libraries it pulls in are heavy (the FAQ calls out around 9 GB on a typical install).
python -m venv ~/.venvs/garak
source ~/.venvs/garak/bin/activate.fish
pip install -U garak
Or grab the latest from main if you want fresh probes:
pip install -U git+https://github.com/NVIDIA/garak.git@main
Sanity check:
python -m garak --list_probes
You should see the full probe list with stars next to the recommended modules and zzz icons next to the inactive ones. Inactive doesn’t mean broken, it usually just means the full probe set is huge and a “Mini” variant runs by default.
If you’d rather use conda the README has a recipe for that too, but venv is fine and skips the conda tax.
first run against local Ollama
I run Ollama on my Beelink homelab, so it’s the easiest target to point at while I’m developing payloads or just kicking the tires. garak supports Ollama via the ollama target type.
python -m garak --target_type ollama --target_name llama3 --probes encoding
That’ll run the encoding-based prompt injection probes against whatever model you’ve pulled. Ollama exposes its API on localhost:11434 by default, and if you’re hitting it from a different host you can set OLLAMA_HOST in your env.
The encoding probes are a good first shot because they’re fast and the failure mode is easy to read in the report. They feed the model prompt injection payloads encoded in things like base64, ROT13, MIME quoted-printable, hex, etc., and check if the model decodes and acts on them. The original encoding work showed newer models were actually more susceptible than older ones, which is a fun finding to wave at developers who insist their newer model “is just safer.”
pointing at AWS Bedrock
This is the one that actually matters for client work. garak supports Bedrock natively. You need creds in the env (or your usual ~/.aws/credentials profile) and then:
python -m garak --target_type bedrock --target_name anthropic.claude-3-5-sonnet-20240620-v1:0 --probes promptinject
Swap the model ID for whatever’s available in the region you’ve got access to. Just make sure the account you’re using has bedrock:InvokeModel on the model in question, and watch your token spend. A full default run can be thousands of requests per probe family. On a paid model that adds up fast.
reading the report
Every run drops two files in ~/.local/share/garak/garak_runs/ (or wherever you’ve configured reporting.report_dir):
- A
*.report.jsonlwith one entry per probe attempt, including the prompt, the model output, and the detector evaluation - A
*.hitlog.jsonlwith only the attempts that registered as a vulnerability hit
The hitlog is what I usually open first. Each line is a JSON object so a quick jq pipeline gets you the prompts and outputs that actually broke the model:
jq '{probe, prompt, output: .outputs[0]}' garak.<uuid>.hitlog.jsonl
There’s also analyse/analyse_log.py shipped in the repo if you want a more structured rollup.
A note from the FAQ that’s easy to miss: probe scores aren’t normalized and you can’t directly compare scores across probes. Higher pass rate is better within a probe, that’s it. So don’t roll a single number up into a slide, the detail matters.
things I keep in mind
- Don’t trust the score across versions. The probes get harder. A model that scored 95% on
dansix months ago may score lower today against the same model just because the probe set improved - The probes test for inference-time weaknesses, not crypto or memory bugs. The FAQ even calls out CWE-1426 (prompt injection) as the kind of thing they’re after. If you’re trying to find a CVE in pytorch, this is the wrong tool
- Some probes are noisy or model-specific. I tend to start with
encoding,promptinject,dan,leakreplay, andxss, then expand from there
what’s next
I’m working on a custom probe for a specific assertion an internal LLM-backed app makes about its own guardrails. Writing one isn’t bad, it’s just a class extending garak.probes.base plus a list of prompts and a detector. Once I’ve got something I’m happy with I’ll put it up here.
Repo: github.com/NVIDIA/garak