Determinism in AI Agents: Why cache_id Is the Missing Feature

Stochastic tools make evals, retries and audit nearly impossible. How cache_id turns an agent run into something you can replay, test and defend.

The eval problem nobody talks about

You’ve written an eval suite for your agent. It passes on Monday, fails on Tuesday, passes on Wednesday, and you haven’t changed a line of code. Welcome to the silent crisis of production agents: the tools underneath them are non-deterministic, and most teams haven’t noticed because they blame the model.

The fix isn’t a better eval framework. It’s making your tools deterministic, and exposing that determinism through a cache_id the agent and the harness can both see. It’s a small API decision with outsized consequences for testing, debugging and compliance.

Why stochastic tools make evals nearly impossible

Consider a lead-scoring agent that calls three tools: company lookup, intent signal, and news/risk. Each tool hits a different upstream source. On run A, the company lookup returns 12 signals. On run B, ten minutes later, it returns 13 — a new press release dropped.

The agent’s final score changes. Your eval harness flags a regression. A week of engineer time goes into finding a bug that doesn’t exist. The only real change was upstream.

This is the default state of scraping-based and search-based agents: every run is a new experiment. You cannot bisect a regression because the independent variable isn’t under your control. You cannot compare two prompts because they never saw the same data. You cannot replay a customer’s failure because the world has moved on.

The people who build LLM evals professionally — Braintrust, LangSmith, the Anthropic evals team — all arrive at the same conclusion: freeze the tool outputs, or don’t bother evaluating the agent.

What `cache_id` is, mechanically

A cache_id is a stable hash of the tuple (tool_name, canonical_inputs, data_window). For FreshGeo, it looks like this:

prc_gb_lnd_gas_2026w17_a81f3c
│   │      │   │        │
│   │      │   │        └── hash of the underlying data snapshot
│   │      │   └──────────── ISO week of the data window
│   │      └──────────────── subject (gas)
│   └─────────────────────── region (GB-LND)
└─────────────────────────── tool family (pricing)

Within a data window — typically one ISO week for pricing, fifteen minutes for news, a day for jobs — identical inputs produce an identical cache_id and a byte-identical response. When the window rolls or the underlying data changes, the hash changes, and you know something moved.

The important property is not the caching. It’s the identity. Two agents, two weeks apart, asking the same question can compare cache_ids and know instantly whether they’re reasoning over the same facts.

What this unlocks

Three things, in order of how much they’ll change your life.

Replayable runs

Log the cache_id of every tool call alongside the agent’s reasoning trace. To replay a run — in an eval, in a postmortem, in a regulator’s office — you re-execute against the same cache_ids and get byte-identical tool outputs. The only variable left is the model, which is exactly where you want your variance.

Production retries without flapping

When a downstream step fails and you retry the agent, you can either re-use the cached tool outputs (for speed) or force a refresh (for freshness). Without cache_id, a retry is a roll of the dice — you might get a different answer purely because a scraper ran again.

Compliance-grade audit

Regulated customers — financial services, healthcare, legal — need to prove what an agent saw at the moment it made a decision. A cache_id plus a signed evidence URL gives you that, years later, with a one-line lookup.

A worked replay

Here’s a real pattern from one of our customers, lightly anonymised. Their underwriting agent flagged a UK SME as high-risk. The customer disputed. Three weeks later, the team needs to show exactly what the agent saw.

// 1. Pull the agent's reasoning trace from the audit log
const run = await auditLog.get('run_2026_04_03_17a3b');

// run.tool_calls => [
//   { tool: 'freshgeo.company.lookup', cache_id: 'cmp_12345678_2026w14_9f2a' },
//   { tool: 'freshgeo.news.risk', cache_id: 'nrk_12345678_2026w14_d41b' },
// ]

// 2. Replay each tool call by cache_id — returns the exact bytes the agent saw
const replayed = await Promise.all(
  run.tool_calls.map(tc => freshgeo.replay(tc.cache_id))
);

// 3. Feed back into the same model + prompt. Model output is now the only variable.

In this specific case, the replay showed the agent had seen a county court judgment filed three days before the decision — which had since been set aside. The agent wasn’t wrong given what it saw. The customer accepted that.

FreshGeo’s response envelope

Every tool, every call, returns the same envelope. This is the contract we care about most.

type Response<T> = {
  cache_id: string;          // stable replay handle
  as_of: string;             // ISO timestamp of the data window
  data_window: string;       // e.g. '2026-W17', 'PT15M', 'P1D'
  data: T;                   // typed, tool-specific payload
  evidence_url: string;      // signed URL, valid 90 days
  sources: Source[];         // publisher + URL + fetched_at per field
  confidence: number;        // aggregate, 0..1
  replay_endpoint: string;   // GET this with cache_id to replay
};

Two fields do the heavy lifting. cache_id is what you log. replay_endpoint is what your eval harness or postmortem tool hits. Everything else is the answer.

What to do on Monday

If you’re building an agent and your tools don’t return a cache key of some kind, you have two options. Add one — wrap your tools in a layer that hashes inputs and freezes outputs for a sensible window. Or use a grounding provider that does this natively.

Either way, the test is simple: can you re-run last week’s failing trace and get the same tool outputs? If the answer is no, you don’t have an eval problem. You have a determinism problem wearing an eval problem’s clothes.

We think this is the single most underrated piece of agent infrastructure in 2026. It’s unglamorous. It’s not on any model card. And it’s the difference between an agent you can ship to regulated customers and one you can’t. See how it fits into the wider FreshGeo surface, or look at pricing if you’re ready to plug it in.