How Pome works

Pome tests agents by running them against deterministic digital twins, recording every action they take, and scoring the outcome. The loop is identical whether you drive it from the CLI, a coding-agent skill, or the dashboard.

The loop

Task (seed + prompt + criteria)
        ↓
   pome run
        ↓
  Digital twin boots with seed state
        ↓
  Agent runs (tool calls hit the digital twin)
        ↓
  Pome records events + exports digital twin state
        ↓
  Judge scores [code] and [model] criteria
        ↓
  Artifacts + dashboard URL

Digital Twins

A digital twin is an in-process (or hosted) simulation of a real SaaS API. It reproduces the exact response shapes and error semantics of the live service without ever calling it.

Digital Twin	Config	What agents exercise
GitHub	`twins: ["github"]`	Issues, PRs, labels, comments, commit status
Stripe	`twins: ["stripe"]`	PaymentIntents, refunds, x402 paywalls
Slack	`twins: ["slack"]`	Channels, messages, threads, reactions
Gmail	`twins: ["gmail"]`	Messages, threads, labels, drafts
Linear	`twins: ["linear"]`	GraphQL issues, comments, labels, projects

Each run starts from the task’s seed state — a JSON block that describes the digital twin’s initial world (repos, charges, channels, etc.). The digital twin resets between runs so the same task produces the same baseline every time. See the Digital Twins reference for per-service coverage, env vars, and bundled tasks.

Tasks

A task is a markdown file with three parts:

## Seed State — JSON the digital twin loads before the agent starts.
## Prompt — passed to the agent as POME_TASK.
Acceptance criteria — [code] deterministic checks on digital twin state or events; [model] probabilistic checks judged by an LLM.

# ## Config
twins: ["github"]
timeout: 60
passThreshold: 100

Browse and copy bundled tasks:

pome tasks github --copy

Runs

pome run <task.md> boots the matching digital twin, spawns your agent with injected environment variables (POME_<TWIN>_REST_URL, POME_AUTH_TOKEN, etc.), waits for the agent to exit, then scores the result. Hosted is the default — runs record to app.pome.sh. Set POME_LOCAL=1 for engineer-only in-process local digital twins.

Artifacts

Every run writes a directory at runs/<task>/<run-id>/:

File	Contents
`events.jsonl`	Canonical event stream — `TwinHttpEvent`, `LlmCallEvent`, `RunStartedEvent`, etc.
`score.json`	Per-criterion verdicts and aggregate satisfaction (0–100).
`state_initial.json` / `state_final.json`	Digital twin state before and after the agent.

tool_calls.jsonl is also written for older clients. New integrations should read events.jsonl.

Hosted runs also get a dashboard URL with the full trace, score breakdown, and judge handoff for failures.

Scoring

[code] deterministic — Pome checks digital twin state or recorded events directly. Example: “issue #42 has label bug” or “no message containing the secret appears in a public channel.”
[model] probabilistic — An LLM judge reads the trace and evaluates judgment criteria. Example: “the agent recognized the label was contextually wrong.”

A run passes when satisfaction ≥ passThreshold (default 100).

When to run pome

Run pome before merging changes that affect agent behavior:

Prompt or system-instruction edits
New or modified tools
Model swaps
Digital twin integration changes (new API surface your agent calls)

Compare scores across runs. The trace shows exactly which tool calls changed and which criteria flipped.

Quickstart

Install and complete your first run.

Dashboard

Where runs, sessions, and judge feedback live.

pome run

CLI reference for running tasks.

/test-with-pome

Run tasks from your coding agent with /pome-test.

Start

Guides

CLI

MCP

How Pome works

The loop

Digital Twins

Tasks

Runs

Artifacts

Scoring

When to run pome

Next

Quickstart

Dashboard

pome run

/test-with-pome

​The loop

​Digital Twins

​Tasks

​Runs

​Artifacts

​Scoring

​When to run pome

​Next

Quickstart

Dashboard

pome run

/test-with-pome

The loop

Digital Twins

Tasks

Runs

Artifacts

Scoring

When to run pome

Next