I Run 46 AI Agents in Production. Here's What Broke.

AGENTS IN PRODUCTION

EXPENSIVE FAILURES

and counting

What Are AI Agents, and Why 46?

An AI agent is a program that can make decisions and take actions on its own. Think of it like a very focused employee that never sleeps. You give it a goal, some rules, and access to tools. It figures out the rest.

I didn't plan to build 46 of them. I had problems. Writing LinkedIn posts took 3 hours a week. Researching prediction markets was inconsistent. I kept missing opportunities because I wasn't watching enough signals. So I built agents to handle each problem, one at a time.

They're organized into teams, like departments in a small company:

Team

Agents

Job

Content Team

Write LinkedIn posts, score engagement, optimize my profile, humanize text

SEO Team

Make sure my content ranks on Google and AI search engines

Trading Team

Research prediction markets, score opportunities, manage risk

Research Team

Deep research on any topic, synthesize findings, track patterns

Game Team

Design game levels, audio, UI, and monetization strategies

Engineering Team

Plan projects, write code, review PRs, catch bugs, triage issues

These aren't demos. They run every day, unsupervised. And that's where things get interesting, because unsupervised software can fail in ways you don't expect.

Here are the 5 failures that taught me the most.

The $500 Infinite Loop

WASTED

in 90 minutes

What happened: My Content Team has an agent that writes LinkedIn post hooks (the opening line that makes people stop scrolling). It generates 10 options, ranks them by predicted engagement, then tries to improve the best ones. The problem? I forgot to tell it when to stop.

Why it matters: The agent kept finding tiny improvements and kept rewriting. By the time I woke up, it had generated 50,000 hook variations for a single post. Each AI call costs about a penny. 50,000 pennies is $500.

This is called an infinite loop: when a program keeps repeating a step forever because nobody told it when to quit. It's like asking someone to "keep improving this essay" without saying "stop after 3 drafts."

The fix: Every agent now has a spending limit, which is a hard cap on how many times it can run and how much money it can spend per day. Think of it like a prepaid debit card instead of an unlimited credit card.

spending-limits.ts

// Every agent gets a spending limit before it can run
interface SpendingLimit {
  maxRetries: number      // "stop after 3 attempts"
  dailyBudget: number     // "you can spend $5 today, max"
  callsSoFar: number      // tracks how many times it's run
}

function canAgentContinue(limit: SpendingLimit): boolean {
  if (limit.callsSoFar >= limit.maxRetries) return false
  if (getTodaysSpend() >= limit.dailyBudget) return false
  return true
}

The hook generator is now capped at 3 improvement rounds and $5/day. When it hits either limit, it returns whatever it has.

When 6 Minutes Cost Me $200

Before: No freshness check

Agent found an opportunity at 11:55 PM. Executed at 12:01 AM. Market had already moved. $200 lost on 6-minute-old data.

After: Mandatory freshness check

Every opportunity has an expiration timer. Fast markets: 5 minutes. Slow signals: 30 minutes. If the timer expires, the agent re-checks before acting.

What happened: My Research Team found a trading opportunity at 11:55 PM. Score: 95 out of 100. Looked great. The agent queued it for execution at 12:01 AM to avoid rate limits. Six minutes later, the market had already moved. The agent executed anyway.

Why it matters: The data wasn't wrong. It was right 6 minutes ago. But in fast-moving markets, 6 minutes is a lifetime. This is called stale data: information that was accurate when collected but is outdated by the time you act on it. It's like driving with a GPS that updates every 10 minutes in a city where roads close every 5.

The fix: Every opportunity now has a freshness timer. If too much time passes between scoring and acting, the agent re-checks the data before proceeding.

freshness-check.ts

// Before acting, check: is this data still fresh?
interface Opportunity {
  score: number
  scoredAt: Date
  maxAge: number  // 5 min for fast markets, 30 min for slow
}

function isFresh(opp: Opportunity): boolean {
  const ageMs = Date.now() - opp.scoredAt.getTime()
  return ageMs <= opp.maxAge
}

// The rule: never act on stale data
async function execute(opp: Opportunity) {
  if (!isFresh(opp)) {
    return rescore(opp)  // re-check, don't blindly execute
  }
  return act(opp)
}

98% Accurate and Still Losing Money

Before: Trusting raw accuracy

98% accuracy sounds amazing. But the 2% wrong calls were all big bets. Right on $1 trades. Wrong on $1,000 trades.

After: Weighted trust score

Trust = (accuracy x 40%) + (big-bet accuracy x 40%) + (confidence calibration x 20%). Big bets only go to agents with high trust.

What happened: My Trading Team agent hit 98% accuracy on prediction market calls over 3 months. Impressive, right? Then the 2% wrong calls wiped out a huge chunk of the gains.

Why it matters: The agent was right almost every time on small, obvious bets. But on the big bets where it was most confident, it was often wrong. An agent that's right 98% of the time on $1 bets and wrong every time on $1,000 bets is a terrible system. This is the difference between accuracy (how often you're right) and calibration (whether your confidence matches reality).

Imagine a weather app that's right 98% of the time about sunny days but wrong every time it predicts rain. You'd still get soaked.

The fix: I replaced simple accuracy with a trust score that weighs how much money was on the line when the agent was right or wrong.

trust-score.ts

// Trust isn't just "how often are you right?"
// It's "are you right when it matters most?"
function computeTrustScore(agent: AgentRecord): number {
  const rawAccuracy = agent.wins / agent.totalTrades
  const bigBetAccuracy = agent.bigBetWins / agent.bigBetTotal
  const calibration = agent.confidenceCalibration

  // Weight big-bet accuracy as heavily as raw accuracy
  return (rawAccuracy * 0.4)
       + (bigBetAccuracy * 0.4)
       + (calibration * 0.2)
}

// Agents with low trust get smaller bets, period
function getMaxBetSize(trustScore: number): number {
  if (trustScore < 0.6) return 10   // $10 max
  if (trustScore < 0.8) return 50   // $50 max
  return 200                         // full allocation
}

Now, agents with a poor trust score get smaller bets regardless of how confident they feel about a specific opportunity. Trust is earned across all bets, not claimed on individual ones.

When All Your Agents Think Alike

What happened: In February 2026, about $400M in positions got liquidated in a single cascade. The cause? Roughly 15,000 autonomous agents across various platforms had similar strategies and similar exit triggers. When the first ones started selling, it pushed prices down, which triggered more agents to sell, which pushed prices down further. A domino effect.

My Trading Team agents weren't directly in that cascade. But I looked at my own system and saw the same pattern: multiple agents reading the same data sources, reaching the same conclusions, and taking the same positions. If something went wrong, they'd all react identically.

The fix: I added diversity rules. Before any agent takes a position, the system checks: "How many of our agents are already betting in this direction?" If too many agree, the new bet gets blocked.

diversity-rules.ts

// If everyone agrees, that's not conviction. That's a blind spot.
function canTakePosition(
  newBet: Trade,
  existingBets: Trade[]
): boolean {
  // Find bets that point in the same direction
  const similarBets = existingBets.filter(
    bet => isSimilar(bet, newBet)
  )
  // If >40% of our money is already going this way, block it
  const totalExposure = sum(similarBets.map(b => b.size))
  return totalExposure < MAX_DIRECTIONAL_EXPOSURE
}

The Failures Nobody Notices

Problem 1: Ghost agents

An agent ran, did nothing useful, reported "success." Published empty LinkedIn comments for 2 weeks. Nobody noticed.

Problem 2: Slow bleed

46 agents each spending a few dollars a day. No single one is expensive. Together: $50+/day on autopilot. Small costs compound quietly.

Ghost agents are the scariest failure mode. The agent runs. It reports "success." But it didn't actually do anything useful. My Content Team had a comment formatter that hit a bug with special characters. Instead of crashing (which I'd have noticed), it quietly returned empty text. The rest of the pipeline kept going and published blank comments for 2 weeks.

The fix for ghost agents: Every agent now files an execution receipt, like a delivery confirmation. If the receipt says "success" but the actual output is empty, that's a contradiction, and the system flags it immediately.

execution-receipt.ts

// Every agent must prove it actually did something
interface ExecutionReceipt {
  agentId: string
  status: "success" | "failure" | "timeout"
  outputHash: string | null  // fingerprint of what was produced
}

// The catch: "success" + no output = something's wrong
function validateReceipt(receipt: ExecutionReceipt): boolean {
  if (receipt.status === "success" && !receipt.outputHash) {
    flagForReview(receipt)  // ghost agent detected
    return false
  }
  return true
}

Slow bleed is the other invisible problem. No single agent is expensive. But 46 agents each spending a few dollars a day adds up fast. The fix is the same spending limit system from Failure #1, but applied at the team level too, not just individual agents.

team-budget.ts

// Individual limits aren't enough. Teams need budgets too.
interface TeamBudget {
  teamName: string
  dailyLimit: number    // the whole team's budget
  perAgentLimit: number // no single agent dominates
  expiresAt: Date       // forces regular review
}

// Example: Content Team gets $15/day across 5 agents
// Each agent maxes out at $5/day
// Budget expires monthly — forces me to review costs

What Actually Works

After all these failures, three patterns survived and I'd use them from day one on any new project.

Pattern

Problem It Solves

Result

Spending limits

Agents that spend without boundaries

Per-agent + per-team daily caps. Max $5/agent/day.

Execution receipts

Ghost agents that silently fail

Every run must prove it did real work. Empty output = flagged.

Trust scores

Accuracy metrics that hide losses

Big-bet accuracy weighted equally with raw accuracy. Low trust = small bets.

Notice what these have in common. They're all boring. Spending limits are just prepaid budgets. Execution receipts are just delivery confirmations. Trust scores are just track records. None of this is cutting-edge AI research. It's basic risk management applied to software.

1. Give every agent an ID

You can't track what you can't identify.

2. Set spending limits

Per-agent AND per-team daily budgets. No exceptions.

3. Require execution receipts

Every run proves it did real work.

4. Track trust over time

Weight results by how much was at stake.

5. Enforce diversity

If most agents agree, block new bets in the same direction.

If you're building agents, start here. Not with the fancy stuff. With the guardrails.

The Takeaway

I've spent real money learning these lessons. $500 on a feedback loop. $200 on stale data. Losses from a poorly calibrated trust system. The patterns that keep my 46 agents running aren't clever. They're borrowed from decades of financial risk management: budget limits, delivery confirmations, track records, and diversification.

The agent space is growing fast. About 30% of Polymarket trades are now agent-driven. Over 550 agent projects exist with a combined $4.34B market cap. Most of them don't have these guardrails yet.

OF POLYMARKET TRADES

are agent-driven

$0B

AGENT MARKET CAP

550+ projects

Build the guardrails before you need them. Don't wait for a $500 bill to make the point.

I Run 46 AI Agents in Production. Here's What Broke.

What Are AI Agents, and Why 46?

The $500 Infinite Loop

The lesson

When 6 Minutes Cost Me $200

Before: No freshness check

After: Mandatory freshness check

The lesson

98% Accurate and Still Losing Money

Before: Trusting raw accuracy

After: Weighted trust score

The lesson

When All Your Agents Think Alike

The February Flash Crash

The lesson

The Failures Nobody Notices

Problem 1: Ghost agents

Problem 2: Slow bleed

The lesson

What Actually Works

1. Give every agent an ID

2. Set spending limits

3. Require execution receipts

4. Track trust over time

5. Enforce diversity

The Takeaway

FAQ

Enjoyed this post?