Skip to main content

Introduction

Pome is a simulation testing infrastructure for AI agents and ML models. It allows you to safely test and validate agent behavior before deploying to real users. With deterministic simulations, Pome provides the deep visibility required to transition autonomous agents from experimental prototypes to production-ready tools.

Simulated Evaluations

To evaluate how agents interact with real-world ecosystems, Pome uses stateful digital twins—sandboxed emulations of live third-party APIs. These twins answer the exact same REST and MCP calls as production systems, allowing you to run comprehensive end-to-end tests, evaluate agent behavior, or train ML models in a completely safe environment. During these evaluations, Pome automatically runs scenarios to identify where agents fail and how they can be optimized. A scenario is defined by three key inputs:
  • Starting world state — The twin’s seeded baseline configuration (e.g., existing repositories, charges, or communication channels).
  • Task to complete — The specific prompt or objective the agent receives.
  • Connected services — The external API twins (GitHub, Stripe, Slack, etc.) the agent must interact with.
By varying these inputs, Pome surfaces elusive edge cases that traditional testing frameworks miss. We bring deterministic safety and deep observability to your AI workflow, offering the fastest path to building reliable agents. Monitor your live runs, traces, and evaluation scores directly on the dashboard at pome.sh.

How to start

Quickstart

Install, log in, install skills, and complete a first scored run in minutes.

How Pome works

Twins, scenarios, runs, scoring, and artifacts explained end to end.

/setup

Wire your coding agent to pome with /pome-setup.