After your agent finishes, Pome evaluates its results entirely on your machine. The cloud never sees your prompts, your agent’s output, or your LLM key. The evaluator is the same open-source code whether you are running self-hosted or using the hosted control plane — only where the artifacts end up differs.Documentation Index
Fetch the complete documentation index at: https://docs.pome.sh/llms.txt
Use this file to discover all available pages before exploring further.
Two evaluator modes
[D] — Deterministic criteria
The CLI queries the twin’s final SQLite state with SQL predicates derived from the criterion text. No LLM call is made. The result is immediate and free.
Example criterion and the query it generates:
[P] — Probabilistic criteria
The CLI calls your configured OpenAI-compatible LLM endpoint with the criterion text, the twin’s final state, and the agent’s stdout. The judge returns a structured verdict:
BYOK — bring your own LLM
Pome never bundles an LLM, never runs a proxy, and never stores your key. You configure the judge endpoint yourself. Supported options:| Provider | POME_LLM_BASE_URL |
|---|---|
| OpenRouter | https://openrouter.ai/api/v1 |
| OpenAI | https://api.openai.com/v1 |
| Anthropic (via OpenAI-compat gateway) | your gateway URL |
| Ollama (local, air-gapped) | http://localhost:11434/v1 |
| Any OpenAI-compatible endpoint | your endpoint URL |
POME_LLM_* variables are not set, the CLI checks for OPENAI_API_KEY (uses api.openai.com) and then ANTHROPIC_API_KEY (uses Anthropic’s native endpoint). If neither is set and a scenario has [P] criteria, those criteria are skipped and logged.
Satisfaction score
Each run produces a satisfaction score from 0 to 100: the percentage of criteria that passed. A scenario with four criteria where three pass scores 75. You can gate CI on a minimum score with--pass-threshold:
0 means all scenarios met the threshold. Exit code 1 means at least one did not.
Run artifacts
Every run writes a directory of artifacts you can inspect, diff, and export. In self-host mode these land atruns/<scenario-slug>/<iso8601-timestamp>/. In hosted mode the same files are uploaded to S3 and accessible via signed URLs.
| File | Contents |
|---|---|
score.json | Final satisfaction score, per-criterion results with pass/fail, reasons, and confidence for [P] criteria. |
tool_calls.jsonl | Every HTTP call the agent made to the twin — method, path, request body, response body, latency, fidelity tier, and whether the call mutated state. |
state_initial.json | Full twin state as loaded from the scenario seed, before the agent ran. |
state_final.json | Full twin state after the agent finished. Same shape as state_initial.json, so any final state can be used as a seed for replay. |
meta.json | Scenario name and hash, model used by the agent, judge model, timing, and the overall satisfaction score. |
Next steps
- Connect your agent — pass
POME_GITHUB_REST_URLto your agent and make your first evaluated run