How Pome scores and evaluates agent runs

After your agent finishes, Pome evaluates its results entirely on your machine. The cloud never sees your prompts, your agent’s output, or your LLM key. The evaluator is the same open-source code whether you are running self-hosted or using the hosted control plane — only where the artifacts end up differs.

Two evaluator modes

`[D]` — Deterministic criteria

The CLI queries the twin’s final SQLite state with SQL predicates derived from the criterion text. No LLM call is made. The result is immediate and free. Example criterion and the query it generates:

[D] Issue #1 has the `bug` label applied
→ SELECT 1 FROM issue_labels WHERE issue_number = 1 AND label_name = 'bug'

A criterion passes if the query returns at least one row. A criterion fails if it returns nothing. There is no ambiguity and no cost.

`[P]` — Probabilistic criteria

The CLI calls your configured OpenAI-compatible LLM endpoint with the criterion text, the twin’s final state, and the agent’s stdout. The judge returns a structured verdict:

{
  "passed": true,
  "confidence": 0.94,
  "reasoning": "Mentions both terms in final paragraph."
}

Pome caches judge calls locally by a hash of the criterion, final state, and agent output. If you re-run a scenario with an identical outcome, the cached result is used and no API call is made.

BYOK — bring your own LLM

Pome never bundles an LLM, never runs a proxy, and never stores your key. You configure the judge endpoint yourself. Supported options:

Provider	`POME_LLM_BASE_URL`
OpenRouter	`https://openrouter.ai/api/v1`
OpenAI	`https://api.openai.com/v1`
Anthropic (via OpenAI-compat gateway)	your gateway URL
Ollama (local, air-gapped)	`http://localhost:11434/v1`
Any OpenAI-compatible endpoint	your endpoint URL

Configure the judge with environment variables:

export POME_LLM_BASE_URL=https://openrouter.ai/api/v1
export POME_LLM_API_KEY=sk-or-...
export POME_LLM_MODEL=anthropic/claude-haiku-4-5

Auto-detection fallback. If the POME_LLM_* variables are not set, the CLI checks for OPENAI_API_KEY (uses api.openai.com) and then ANTHROPIC_API_KEY (uses Anthropic’s native endpoint). If neither is set and a scenario has [P] criteria, those criteria are skipped and logged.

Satisfaction score

Each run produces a satisfaction score from 0 to 100: the percentage of criteria that passed. A scenario with four criteria where three pass scores 75. You can gate CI on a minimum score with --pass-threshold:

pome run scenarios/ --pass-threshold 100

Exit code 0 means all scenarios met the threshold. Exit code 1 means at least one did not.

Run artifacts

Every run writes a directory of artifacts you can inspect, diff, and export. In self-host mode these land at runs/<scenario-slug>/<iso8601-timestamp>/. In hosted mode the same files are uploaded to S3 and accessible via signed URLs.

File	Contents
`score.json`	Final satisfaction score, per-criterion results with pass/fail, reasons, and confidence for `[P]` criteria.
`tool_calls.jsonl`	Every HTTP call the agent made to the twin — method, path, request body, response body, latency, fidelity tier, and whether the call mutated state.
`state_initial.json`	Full twin state as loaded from the scenario seed, before the agent ran.
`state_final.json`	Full twin state after the agent finished. Same shape as `state_initial.json`, so any final state can be used as a seed for replay.
`meta.json`	Scenario name and hash, model used by the agent, judge model, timing, and the overall satisfaction score.

{
  "satisfaction_score": 75,
  "judge_model": "claude-haiku-4-5",
  "judge_tokens_in": 1842,
  "judge_tokens_out": 96,
  "criteria_results": [
    {
      "criterion": { "type": "D", "text": "Issue #1 has the `bug` label applied" },
      "passed": true,
      "skipped": false,
      "reason": "Matched 1 row in issue_labels where issue_number=1 and label_name='bug'."
    },
    {
      "criterion": { "type": "D", "text": "Issue #1 is assigned to `alice`" },
      "passed": false,
      "skipped": false,
      "reason": "No rows in issue_assignees where issue_number=1 and user_login='alice'."
    },
    {
      "criterion": { "type": "D", "text": "No new labels were created" },
      "passed": true,
      "skipped": false,
      "reason": "Label table unchanged from seed state."
    },
    {
      "criterion": { "type": "P", "text": "The agent's final summary mentions \"bug\" and \"alice\"" },
      "passed": true,
      "skipped": false,
      "reason": "Mentions both terms in final paragraph.",
      "confidence": 0.94,
      "judge_model": "claude-haiku-4-5"
    }
  ]
}

Next steps

Connect your agent — pass POME_GITHUB_REST_URL to your agent and make your first evaluated run

Get Started

Core Concepts

Guides

CLI Reference

Plans & Pricing

How Pome scores and evaluates agent runs

Two evaluator modes

`[D]` — Deterministic criteria

`[P]` — Probabilistic criteria

BYOK — bring your own LLM

Satisfaction score

Run artifacts

Next steps

Get Started

Core Concepts

Guides

CLI Reference

Plans & Pricing

Documentation Index

​Two evaluator modes

​[D] — Deterministic criteria

​[P] — Probabilistic criteria

​BYOK — bring your own LLM

​Satisfaction score

​Run artifacts

​Next steps

Two evaluator modes

`[D]` — Deterministic criteria

`[P]` — Probabilistic criteria

BYOK — bring your own LLM

Satisfaction score

Run artifacts

Next steps