Scenarios: define agent tests in Markdown

A Pome scenario is a plain Markdown file. It is human-readable enough for a product manager to review and machine-parseable enough for the Pome CLI to execute without any configuration beyond the file itself. Each scenario defines the world state your agent starts from, the task it should perform, and the criteria Pome uses to decide whether it passed.

Anatomy of a scenario file

# Triage open issues in acme/api

## Setup
A repository acme/api with labels bug, feature, and question. Collaborators
alice (owns orders) and bob (owns auth). Open issue #1: "500 error on
POST /orders after deploy".

```json
{
  "repositories": [
    {
      "owner": "acme",
      "name": "api",
      "labels": [
        { "name": "bug",      "color": "d73a4a" },
        { "name": "feature",  "color": "0e8a16" },
        { "name": "question", "color": "cccccc" }
      ],
      "collaborators": ["alice", "bob"],
      "issues": [
        {
          "number": 1,
          "title": "500 error on POST /orders after deploy",
          "body": "Started failing right after the 14:00 deploy. Stack trace points to OrderController#create.",
          "state": "open",
          "labels": [],
          "assignees": []
        }
      ]
    }
  ]
}
```

## Prompt
Triage issue #1 in acme/api.

## Expected Behavior
The agent reads the issue, recognizes it as a bug, applies the bug label,
assigns alice (orders area), and stops.

## Success Criteria
- [D] Issue #1 has the `bug` label applied
- [D] Issue #1 is assigned to `alice`
- [D] No new labels were created
- [P] The agent's final summary mentions "bug" and "alice"

## Config
twins: github
timeout: 60
runs: 1
judge: claude-haiku-4-5

Sections

Section	Required	Purpose
`# Title`	Yes	Becomes the scenario name in run artifacts and the dashboard.
`## Setup`	No	Seeds the twin’s SQLite state before the agent runs. The JSON block is what the parser loads; the surrounding prose is for humans only.
`## Prompt`	Yes	The task handed to your agent via the `POME_TASK` environment variable.
`## Expected Behavior`	No	Human-readable description for reviewers. Never sent to the agent; available to the evaluator as context.
`## Success Criteria`	Yes (≥ 1)	One criterion per line, prefixed with `[D]` or `[P]`.
`## Config`	No	Per-scenario settings as bare `key: value` pairs or a fenced YAML block.

Criteria types

`[D]` — Deterministic

Deterministic criteria are evaluated by querying the twin’s final SQLite state directly. No LLM call is made. The CLI matches the criterion text against a library of SQL predicates:

Issue #N has the <label> label applied → queries issue_labels
Issue #N is assigned to <user> → queries issue_assignees
No new labels were created → diffs the labels table between seed and final state

Deterministic checks are instant, free, and perfectly reproducible across runs.

`[P]` — Probabilistic

Probabilistic criteria are evaluated by an LLM judge. You supply your own key (BYOK) — Pome never holds or proxies your LLM credentials. The judge receives the criterion text, the final twin state, and the agent’s stdout, and returns a passed / confidence / reasoning verdict. If a [D] criterion’s text doesn’t match any known SQL pattern, the CLI automatically falls back to [P] mode and logs a notice so you can sharpen the wording.

Config keys

Key	Type	Default	Notes
`twins`	string or string[]	inferred from `.pome.json`	Which twin(s) this scenario requires.
`timeout`	number (seconds)	`180`	Per-run timeout before the agent is killed.
`runs`	number	`1`	How many times to repeat the scenario. Useful for building confidence in `[P]` criteria.
`seed`	string	`"default"`	Named seed variant. `"default"` uses the JSON in `## Setup`.
`judge`	string	env default	Judge model identifier passed as-is to your configured LLM endpoint (e.g., `gpt-4o-mini`, `claude-haiku-4-5`, `anthropic/claude-haiku-4-5` for OpenRouter).
`tags`	string[]	`[]`	Arbitrary tags for filtering runs with `--tag`.

Use [D] criteria wherever the success condition can be expressed as a state fact. They are faster, cheaper, and deterministic — a [D] check costs nothing and never flips between runs with the same final state.

Next steps

Write your first scenario — step-by-step guide with working examples

Get Started

Core Concepts

Guides

CLI Reference

Plans & Pricing

Scenarios: define agent tests in Markdown

Anatomy of a scenario file

Sections

Criteria types

`[D]` — Deterministic

`[P]` — Probabilistic

Config keys

Next steps

Get Started

Core Concepts

Guides

CLI Reference

Plans & Pricing

Documentation Index

​Anatomy of a scenario file

​Sections

​Criteria types

​[D] — Deterministic

​[P] — Probabilistic

​Config keys

​Next steps

Anatomy of a scenario file

Sections

Criteria types

`[D]` — Deterministic

`[P]` — Probabilistic

Config keys

Next steps