GURUBOXZ
All work

Case study · Field Note (stealth) · Developer tools / AI · 2025

Turning an LLM demo into a system that survived contact.

Engagement11 weeks
Team2 engineers (AI specialists)
Year2025
SectorDeveloper tools / AI
The problem

Field Note had a working agent in the demo environment and zero confidence in production. Every change to a prompt risked regressions nobody could measure. The team wanted to ship a public beta in 14 weeks; they had no eval, no guardrails, no audit, and no way to know if a regression had happened until a user complained on X.

Outcome
  • 0 → 312
    Eval test cases
  • 9 (in 11 weeks)
    Regressions caught pre-merge
  • −42%
    P50 latency
  • −61%
    Per-request cost
Approach
  • Built a versioned eval suite (300+ test cases) covering correctness, refusal, jailbreak, latency, and cost.
  • Shadow-mode rollout: production traffic mirrored to challenger models, scored offline.
  • Cost + latency budgets enforced in CI — no merge if a PR blows the budget.
  • Full request audit: prompt, tools, outputs, cost — searchable by customer, by date, by failure mode.
We stopped praying. We started shipping. The eval harness flagged nine regressions before they reached a single customer.
Field Note (stealth)
The build

The team had built an impressive agent: tools, retrieval, structured outputs, the works. The problem wasn't the agent. The problem was that every meaningful change to it was a leap of faith. Prompt tweak? Cross your fingers. New tool? Cross your fingers. Provider update? Cross all your fingers.

We started with the eval harness. The first version was deliberately ugly: 30 hand-curated test cases across five categories (correctness, refusal, jailbreak, latency, cost). The team added 280 more cases in the next eight weeks because, once eval existed, every bug became a permanent test case.

Shadow mode came next. Production requests fork to a challenger model in parallel. The challenger doesn't respond to the user — it just runs, gets scored offline, and surfaces in a daily report. We caught one regression in the first 48 hours that the team would have shipped that week.

Cost + latency budgets in CI was the last load-bearing piece. A PR that increases P50 latency by more than 15% or per-request cost by more than 20% fails the build. Engineers actually like it now. The budget is the conversation, not the after-the-fact apology.

Eleven weeks. Public beta shipped on schedule. P50 latency down 42% (we found cheaper paths to the same answer along the way). Per-request cost down 61%. Nine regressions caught before they ever reached a customer. The team is shipping changes every day instead of every other week.

Working on something like this?

We're booking Q3. 30 minutes, no pitch.