Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pome.sh/llms.txt

Use this file to discover all available pages before exploring further.

After your agent finishes, Pome evaluates its results entirely on your machine. The cloud never sees your prompts, your agent’s output, or your LLM key. The evaluator is the same open-source code whether you are running self-hosted or using the hosted control plane — only where the artifacts end up differs.

Two evaluator modes

[D] — Deterministic criteria

The CLI queries the twin’s final SQLite state with SQL predicates derived from the criterion text. No LLM call is made. The result is immediate and free. Example criterion and the query it generates:
[D] Issue #1 has the `bug` label applied
→ SELECT 1 FROM issue_labels WHERE issue_number = 1 AND label_name = 'bug'
A criterion passes if the query returns at least one row. A criterion fails if it returns nothing. There is no ambiguity and no cost.

[P] — Probabilistic criteria

The CLI calls your configured OpenAI-compatible LLM endpoint with the criterion text, the twin’s final state, and the agent’s stdout. The judge returns a structured verdict:
{
  "passed": true,
  "confidence": 0.94,
  "reasoning": "Mentions both terms in final paragraph."
}
Pome caches judge calls locally by a hash of the criterion, final state, and agent output. If you re-run a scenario with an identical outcome, the cached result is used and no API call is made.

BYOK — bring your own LLM

Pome never bundles an LLM, never runs a proxy, and never stores your key. You configure the judge endpoint yourself. Supported options:
ProviderPOME_LLM_BASE_URL
OpenRouterhttps://openrouter.ai/api/v1
OpenAIhttps://api.openai.com/v1
Anthropic (via OpenAI-compat gateway)your gateway URL
Ollama (local, air-gapped)http://localhost:11434/v1
Any OpenAI-compatible endpointyour endpoint URL
Configure the judge with environment variables:
export POME_LLM_BASE_URL=https://openrouter.ai/api/v1
export POME_LLM_API_KEY=sk-or-...
export POME_LLM_MODEL=anthropic/claude-haiku-4-5
Auto-detection fallback. If the POME_LLM_* variables are not set, the CLI checks for OPENAI_API_KEY (uses api.openai.com) and then ANTHROPIC_API_KEY (uses Anthropic’s native endpoint). If neither is set and a scenario has [P] criteria, those criteria are skipped and logged.

Satisfaction score

Each run produces a satisfaction score from 0 to 100: the percentage of criteria that passed. A scenario with four criteria where three pass scores 75. You can gate CI on a minimum score with --pass-threshold:
pome run scenarios/ --pass-threshold 100
Exit code 0 means all scenarios met the threshold. Exit code 1 means at least one did not.

Run artifacts

Every run writes a directory of artifacts you can inspect, diff, and export. In self-host mode these land at runs/<scenario-slug>/<iso8601-timestamp>/. In hosted mode the same files are uploaded to S3 and accessible via signed URLs.
FileContents
score.jsonFinal satisfaction score, per-criterion results with pass/fail, reasons, and confidence for [P] criteria.
tool_calls.jsonlEvery HTTP call the agent made to the twin — method, path, request body, response body, latency, fidelity tier, and whether the call mutated state.
state_initial.jsonFull twin state as loaded from the scenario seed, before the agent ran.
state_final.jsonFull twin state after the agent finished. Same shape as state_initial.json, so any final state can be used as a seed for replay.
meta.jsonScenario name and hash, model used by the agent, judge model, timing, and the overall satisfaction score.
{
  "satisfaction_score": 75,
  "judge_model": "claude-haiku-4-5",
  "judge_tokens_in": 1842,
  "judge_tokens_out": 96,
  "criteria_results": [
    {
      "criterion": { "type": "D", "text": "Issue #1 has the `bug` label applied" },
      "passed": true,
      "skipped": false,
      "reason": "Matched 1 row in issue_labels where issue_number=1 and label_name='bug'."
    },
    {
      "criterion": { "type": "D", "text": "Issue #1 is assigned to `alice`" },
      "passed": false,
      "skipped": false,
      "reason": "No rows in issue_assignees where issue_number=1 and user_login='alice'."
    },
    {
      "criterion": { "type": "D", "text": "No new labels were created" },
      "passed": true,
      "skipped": false,
      "reason": "Label table unchanged from seed state."
    },
    {
      "criterion": { "type": "P", "text": "The agent's final summary mentions \"bug\" and \"alice\"" },
      "passed": true,
      "skipped": false,
      "reason": "Mentions both terms in final paragraph.",
      "confidence": 0.94,
      "judge_model": "claude-haiku-4-5"
    }
  ]
}

Next steps

  • Connect your agent — pass POME_GITHUB_REST_URL to your agent and make your first evaluated run