Your new colleague forgets the brief. Won't change the plan. Once ran up $47,000 in API costs while two agents talked to each other in a loop. For eleven days. Nobody noticed.
We were promised software that thinks, plans, and acts. What we got: agents stuck on pop-ups they can't close. Infinite loops burning five figures. Confident execution of wrong processes.
The fix isn't a smarter model. It's architecture. And knowing your own process before you hand it to a machine.
---
CMU built a fake company called TheAgentCompany. Real office tasks. Best AI agents available. Same tasks, same environment, over and over.
$ benchmark --model claude-3.5-sonnet --tasks office
Tasks completed: 24%
Tasks failed: 76%
$ benchmark --model gemini --tasks office
Tasks completed: 11%
$ benchmark --model gpt-4o --tasks office
Tasks completed: 8.6%The top agent failed three out of four times on standard office work.
One agent couldn't close a pop-up on a website. It gave up. Another couldn't find someone in the company chat, so it renamed a different user to match the name it needed. The researchers called it "creating fake shortcuts."
For tech that's supposed to replace human work, that's not a small bug. That is the product.
[CMU TheAgentCompany](https://www.cs.cmu.edu/news/2025/agent-company) / [Paper](https://arxiv.org/abs/2412.14161)
---
Most automation is a chain. Read request. Find customer. Check history. Update CRM. Send response.
If every step works, you're fine. One step wrong? Everything downstream breaks.
Patronus AI ran the numbers. A 1% error rate per step—one wrong move in a hundred—turns into a 63% chance of failure by step 100.
$ simulate --error-rate 0.01 --steps 100
Cumulative failure probability: 63.4%
Status: unreliableThe more steps your agent takes, the more likely the whole run is garbage. Another benchmark across 34 tasks and three agent frameworks landed at about 50% task completion. Half the time, they don't even finish.
Great in demos. Falls apart when the task gets long and messy.
[VentureBeat / Patronus](https://venturebeat.com/infrastructure/ai-agents-fail-63-of-the-time-on-complex-tasks-patronus-ai-says-its-new) / [34-task benchmark](https://arxiv.org/abs/2508.13143)
---
Humans hit a wall and rethink. Agents don't.
They make a plan once and execute it. Even when the plan is wrong. Even when every signal says stop.
McKinsey's assessment: LLMs are "fundamentally passive." They struggle with multi-step, branching workflows. 90% of vertical use cases are still stuck in pilot.
Not edge cases. Most of what companies want to do with agents.
// What a human does
function humanWorkflow(task) {
const plan = createPlan(task);
for (const step of plan.steps) {
const result = execute(step);
if (result.failed) {
return replan(task, result.context); // rethink
}
}
}
// What an agent does
function agentWorkflow(task) {
const plan = createPlan(task);
for (const step of plan.steps) {
execute(step); // no checking, no replanning, keep going
}
}They keep running a bad plan instead of fixing it. And there's a deeper problem. Even when they have a plan, they forget it.
[McKinsey - Seizing the agentic AI advantage](https://www.mckinsey.com/capabilities/quantumblack/our-insights/seizing-the-agentic-ai-advantage)
---
Long tasks break agents for one reason. Context windows.
As the conversation grows, the model has to "remember" everything in that window. It doesn't.
Anthropic calls it context rot. The more tokens you stuff in, the worse the model gets at recalling what matters. By step 7, the agent contradicts what it decided in step 2. Early context gets pushed out or drowned in noise.
Imagine a project manager who forgets half the project while working on it.
That's not a metaphor. That's what happens.
$ agent --task "multi-step workflow" --observe
Step 1: Read requirements ✓
Step 2: Design schema ✓
Step 3: Implement logic ✓
...
Step 7: Deploy changes
Warning: contradicts schema from step 2
Warning: references deleted requirements
Status: context rot detectedAnd when the tools themselves break? Agents don't ask for help. They loop.
[Anthropic - Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
---
Agents talk to databases, APIs, search engines, internal tools. When a tool call fails, agents rarely ask for help. They loop. They output wrong data. They fail silently.
One team learned this the hard way. Four LangChain agents coordinating on market research.
Week 1: $127
Week 2: $891
Week 3: $6,240
Week 4: $18,400
Total: $47,000.
Two agents got stuck in an infinite conversation loop. For 11 days. Nobody noticed until the bill arrived.
So much for "autonomous automation."
[Youssef Hosni - We spent $47,000 running AI agents](https://youssefh.substack.com/p/we-spent-47000-running-ai-agents)
---
Deloitte's 2026 State of AI report: 75% of companies plan to invest in agentic AI.
How many have agents actually running in production? 11%.
MIT Media Lab looked at 300+ AI initiatives. 95% of enterprise AI pilots delivered zero measurable return. Only 5% made it to production with real impact.
Gartner says over 40% of agentic AI projects will be cancelled by end of 2027. Costs too high. Value unclear. Risk too real.
The current wave isn't "revolutionary." It's experimental. And most of it won't ship.
Why? One thing. We're automating chaos.
[Deloitte State of AI 2026](https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html) / [MIT NANDA report](https://mlq.ai/media/quarterlydecks/v0.1StateofAIinBusiness2025Report.pdf) / [Gartner via Reuters](https://www.reuters.com/business/over-40-agentic-ai-projects-will-be-scrapped-by-2027-gartner-says-2025-06-25)
---
Someone studied 20 companies deploying AI agents over five months. Fourteen were trying to automate processes that were never documented, never stable, and in many cases never understood by the people doing the work.
A wealth management firm spent two months training an agent on client onboarding.
The official process had 12 steps.
They watched three analysts do it in real life. The real process had 47 steps.
Three informal Slack pings to compliance. Two Excel sheets "everyone just knows about." A monthly check-in with a vendor whose contract had technically expired.
The agent followed the 12-step manual. It confidently did the wrong thing.
The agent wasn't broken. The process was.
Most companies don't know their own workflows well enough to automate them. You can't hand a machine a process you don't understand yourself.
[Abdul Tayyeb Datarwala - I studied 20 companies using AI agents](https://medium.com/@tayyeb.datar/i-studied-20-companies-using-ai-agents-heres-why-most-will-fail-68c7413bce03)
---
Researchers showed you can attack agents with "malfunction amplification." You mislead them into repetitive or useless actions.
Failure rates went over 80%. Those attacks are hard to catch with LLMs alone.
Unsupervised agents in finance or infrastructure aren't just brittle. They're a security risk.
This isn't "models aren't smart enough yet." It's an architecture problem.
[Breaking Agents - arXiv 2407.20859](https://arxiv.org/abs/2407.20859)
---
Most agents today: prompt goes in, LLM reasons, makes a tool call, spits out output.
Reliable automation needs something different:
$ describe --architecture reliable-agent
Components:
1. Intent layer → what the user actually wants
2. Planner → breaks intent into steps
3. Executor → runs each step
4. State manager → tracks what happened
5. Memory → persists decisions across sessions
6. Verifier → checks output against intent
Status: mostly missing from current systemsMcKinsey said it after a year of deployment work: getting real value from agentic AI means changing whole workflows, not just dropping in an agent. The architecture is missing. Bigger context windows and smarter models won't fix that alone.
[McKinsey - One year of agentic AI: six lessons](https://www.mckinsey.com/capabilities/quantumblack/our-insights/one-year-of-agentic-ai-six-lessons-from-the-people-doing-the-work)
---
They're not useless. They're early.
They work when:
CMU found agents handle structured work like data analysis but struggle with anything requiring real reasoning. Salesforce's CRMArena-Pro benchmark: 58% success in single-turn scenarios, about 35% in multi-turn.
Single shot, clear task: okay. Multi-step, lots of decisions: not yet.
The companies winning with agents aren't the ones that moved fastest or spent the most. They're the ones that understood their own processes first.
[Salesforce CRMArena-Pro](https://arxiv.org/abs/2505.18878)
---
Every failure in this piece—forgetting, looping, wrong plans, broken processes—traces back to the same gap. Agents have no context engineering.
Context engineering isn't "dump everything into the prompt." It's deciding exactly what information gets into the model's limited attention at each step. What it sees. What it keeps. What it drops.
Without it, agents forget what they did three steps ago, lose track of which tools worked, can't carry decisions across sessions, and treat every task like the first time. The context window fills with noise. Coherence disappears.
That's not an intelligence problem. It's an infrastructure problem.
The solution: instead of stuffing the whole world into the context window and hoping the model pays attention, you put agent memory in a structured layer and retrieve only what's relevant at each step.
$ context-engine --describe
Branches:
tool_knowledge → what tools exist, when to use them
project_context → what's been observed and decided
session_memory → what happened this run
user_preferences → how things should be done
Strategy: smallest high-signal set per turn
Result: old noise fades, important decisions stickSeparate knowledge into branches. Do context engineering automatically every turn. Smallest high-signal set for the current task, injected into the agent's working memory.
That's what we built [LocusGraph](https://locusgraph.com) to do. A context engineering layer between your agent and its memory. Agents that learn, remember, and improve—without context rot, token overflow, or repeating the same mistakes.
If you're building agents that need to work in the real world—not just on stage—the first thing to fix is their memory.
[locusgraph.com](https://locusgraph.com)
---
1. CMU TheAgentCompany - [CMU News](https://www.cs.cmu.edu/news/2025/agent-company) / [Paper](https://arxiv.org/abs/2412.14161)
2. Error compounding (1% → 63%) - [VentureBeat](https://venturebeat.com/infrastructure/ai-agents-fail-63-of-the-time-on-complex-tasks-patronus-ai-says-its-new) / [Business Insider](https://www.businessinsider.com/ai-agents-errors-hallucinations-compound-risk-2025-4)
3. 34-task benchmark (~50%) - [Quantum Zeitgeist](https://quantumzeitgeist.com/ai-agents-fail-half-the-time-new-benchmark-reveals-weaknesses/) / [Paper](https://arxiv.org/abs/2508.13143)
4. McKinsey - [Seizing the agentic AI advantage](https://www.mckinsey.com/capabilities/quantumblack/our-insights/seizing-the-agentic-ai-advantage)
5. Anthropic - [Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
6. Deloitte - [State of AI 2026](https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html)
7. MIT Media Lab - [State of AI in Business 2025](https://mlq.ai/media/quarterlydecks/v0.1StateofAIinBusiness2025Report.pdf)
8. Gartner - [40% agent projects scrapped by 2027](https://www.reuters.com/business/over-40-agentic-ai-projects-will-be-scrapped-by-2027-gartner-says-2025-06-25)
9. 20 companies study - [Medium](https://medium.com/@tayyeb.datar/i-studied-20-companies-using-ai-agents-heres-why-most-will-fail-68c7413bce03)
10. Breaking Agents - [arXiv 2407.20859](https://arxiv.org/abs/2407.20859)
11. McKinsey - [One year of agentic AI](https://www.mckinsey.com/capabilities/quantumblack/our-insights/one-year-of-agentic-ai-six-lessons-from-the-people-doing-the-work)
12. Salesforce - [CRMArena-Pro](https://arxiv.org/abs/2505.18878)
13. $47k agent loop - [Youssef Hosni](https://youssefh.substack.com/p/we-spent-47000-running-ai-agents)
// What a human does
function humanWorkflow(task) {
const plan = createPlan(task);
for (const step of plan.steps) {
const result = execute(step);
if (result.failed) {
return replan(task, result.context); // rethink
}
}
}
// What an agent does
function agentWorkflow(task) {
const plan = createPlan(task);
for (const step of plan.steps) {
execute(step); // no checking, no replanning, keep going
}
}$ describe --architecture reliable-agent
Components:
1. Intent layer → what the user actually wants
2. Planner → breaks intent into steps
3. Executor → runs each step
4. State manager → tracks what happened
5. Memory → persists decisions across sessions
6. Verifier → checks output against intent
Status: mostly missing from current systems