garak – Dotwut.io

I’ve been spending a lot of time with garak lately, both for client work and for my own OSAI prep. Figured it was worth a writeup.

what it is

garak is an LLM vulnerability scanner from NVIDIA (originally written by Leon Derczynski before NVIDIA picked it up). The README pitches it as something like nmap or Metasploit but for LLMs, which is a fair comparison if you squint. You point it at a model, it runs a battery of probes against it, detectors flag anything that looks like a hit, and you get a JSONL report at the end with per-probe pass/fail rates.

The probe categories cover most of what you’d want to test on a model or LLM-backed app: prompt injection (including the original Agency Enterprise PromptInject framework that won best paper at the NeurIPS ML Safety Workshop 2022), jailbreaks (DAN variants), encoding attacks, training data leakage (leakreplay), package hallucination, malware generation, XSS payload generation, toxicity, and a bunch more. Worth listing them all once with --list_probes so you know what’s in the box.

why I use it

Most of my day job is FedRAMP cloud sec work, but a growing chunk is AI/LLM testing. The LLM space moves fast and the threat surface keeps changing. Having a tool that gives me a repeatable, scriptable baseline against any target is huge. A few of the wins:

It hits a lot of well-known weakness classes in one run, which is a reasonable starting point before I go manual
Output is JSONL so it’s easy to grep, parse, or feed into a report
Supports OpenAI, Hugging Face, AWS Bedrock, Replicate, LiteLLM, GGUF (llama.cpp), and basically anything reachable over REST, so it slots into whatever the client is running
Probes get added and tuned over time, so an old score becomes a worse score on the same model later, which I think is a feature not a bug

It’s also one of the tools I’m leaning on for OSAI prep, since the probe taxonomy maps reasonably well to the LLM01 through LLM10 list from OWASP.

install

I run CachyOS with Fish shell. I keep garak in its own venv to avoid dependency wrestling, since the ML libraries it pulls in are heavy (the FAQ calls out around 9 GB on a typical install).

python -m venv ~/.venvs/garak
source ~/.venvs/garak/bin/activate.fish
pip install -U garak

Or grab the latest from main if you want fresh probes:

pip install -U git+https://github.com/NVIDIA/garak.git@main

Sanity check:

python -m garak --list_probes

You should see the full probe list with stars next to the recommended modules and zzz icons next to the inactive ones. Inactive doesn’t mean broken, it usually just means the full probe set is huge and a “Mini” variant runs by default.

If you’d rather use conda the README has a recipe for that too, but venv is fine and skips the conda tax.

first run against local Ollama

I run Ollama on my Beelink homelab, so it’s the easiest target to point at while I’m developing payloads or just kicking the tires. garak supports Ollama via the ollama target type.

python -m garak --target_type ollama --target_name llama3 --probes encoding

That’ll run the encoding-based prompt injection probes against whatever model you’ve pulled. Ollama exposes its API on localhost:11434 by default, and if you’re hitting it from a different host you can set OLLAMA_HOST in your env.

The encoding probes are a good first shot because they’re fast and the failure mode is easy to read in the report. They feed the model prompt injection payloads encoded in things like base64, ROT13, MIME quoted-printable, hex, etc., and check if the model decodes and acts on them. The original encoding work showed newer models were actually more susceptible than older ones, which is a fun finding to wave at developers who insist their newer model “is just safer.”

pointing at AWS Bedrock

This is the one that actually matters for client work. garak supports Bedrock natively. You need creds in the env (or your usual ~/.aws/credentials profile) and then:

python -m garak --target_type bedrock --target_name anthropic.claude-3-5-sonnet-20240620-v1:0 --probes promptinject

Swap the model ID for whatever’s available in the region you’ve got access to. Just make sure the account you’re using has bedrock:InvokeModel on the model in question, and watch your token spend. A full default run can be thousands of requests per probe family. On a paid model that adds up fast.

reading the report

Every run drops two files in ~/.local/share/garak/garak_runs/ (or wherever you’ve configured reporting.report_dir):

A *.report.jsonl with one entry per probe attempt, including the prompt, the model output, and the detector evaluation
A *.hitlog.jsonl with only the attempts that registered as a vulnerability hit

The hitlog is what I usually open first. Each line is a JSON object so a quick jq pipeline gets you the prompts and outputs that actually broke the model:

jq '{probe, prompt, output: .outputs[0]}' garak.<uuid>.hitlog.jsonl

There’s also analyse/analyse_log.py shipped in the repo if you want a more structured rollup.

A note from the FAQ that’s easy to miss: probe scores aren’t normalized and you can’t directly compare scores across probes. Higher pass rate is better within a probe, that’s it. So don’t roll a single number up into a slide, the detail matters.

things I keep in mind

Don’t trust the score across versions. The probes get harder. A model that scored 95% on dan six months ago may score lower today against the same model just because the probe set improved
The probes test for inference-time weaknesses, not crypto or memory bugs. The FAQ even calls out CWE-1426 (prompt injection) as the kind of thing they’re after. If you’re trying to find a CVE in pytorch, this is the wrong tool
Some probes are noisy or model-specific. I tend to start with encoding, promptinject, dan, leakreplay, and xss, then expand from there

what’s next

I’m working on a custom probe for a specific assertion an internal LLM-backed app makes about its own guardrails. Writing one isn’t bad, it’s just a class extending garak.probes.base plus a list of prompts and a detector. Once I’ve got something I’m happy with I’ll put it up here.

Repo: github.com/NVIDIA/garak