Stop Hoping Your AI Agent Will Figure It Out

On why controlling the flow matters more than adding more tools · March 25, 2026 · 12 min read

There is a gold rush happening right now in the AI tooling space. Everyone is building plugins, connectors, and tool integrations for AI agents. The idea is simple: give the agent more capabilities, and it will do more for you.

File access. Web search. Database queries. Code execution. Calendar management. The list grows every week. And the assumption is always the same: if we just give the agent the right tools, it will figure out how to use them in the right order, at the right time, for the right reasons.

This assumption is wrong. And I'm not the only one saying that.

The toolbox fallacy

Imagine you hand someone a hammer, a saw, a drill, some nails, screws, and a pile of wood. Then you say: "Build me a bookshelf." A skilled carpenter will figure it out. But an apprentice? They might nail boards together in the wrong order, skip measurements, or build something that wobbles.

This is what we do with AI agents today. We stuff their toolbox full of capabilities and then write a prompt that says "do this complex multi-step task" and hope they pick the right tool at each step, in the right sequence, with the right inputs.

Sometimes it works. Often it doesn't. And when it doesn't, the common reaction is: "We need a better prompt" or "We need more tools." Rarely does anyone say: "Maybe we shouldn't let the agent decide the order of operations."

The MCP (Model Context Protocol) ecosystem makes this even more tempting. You can plug in dozens of tool servers and suddenly your agent can browse the web, query databases, manage files, call APIs, and more. It feels like progress. But more tools in the hands of a single autonomous agent doesn't solve the core problem: who decides what to do, and in what order?

The data backs this up. A blog post by Anthropic on building effective agents draws a clear line between workflows - where code orchestrates LLMs through predefined paths - and agents - where the LLM decides its own process. Their recommendation: "find the simplest solution possible, and only increase complexity when needed." Use workflows when the process is deterministic and clear. Use agents when flexibility and reasoning are essential.

Andrew Ng identified four design patterns for agentic workflows: Reflection, Tool Use, Planning, and Multi-Agent Collaboration. The key insight is that even in "agentic" systems, the workflow prompts the LLM multiple times in a structured way - giving it opportunities to build step by step toward higher-quality output. That's orchestration, not autonomy.

And in the real world? An IntuitionLabs analysis found that AI workflows dominate in production, while fully autonomous agents remain largely experimental. Gartner's research shows less than 5% of organizations have fully integrated AI agents in operations today.

A different approach: you drive, AI does the work

There is another way. Instead of building one super-agent with a dozen tools and a long prompt, you write a simple script that controls the entire flow. The script decides what happens first, what happens next, and what to do when something goes wrong. The AI is called at each step to do one specific, well-defined job.

Think of it like a factory assembly line versus a freelancer. The freelancer decides their own workflow. The assembly line doesn't. Each station has one job. The line decides the order. The worker at each station is great at their one task, but they never have to decide what comes next.

Picture a manager delegating tasks. The manager doesn't do the work, but they decide what needs to happen and in what order. The AI is the specialist who gets called in for each job:

"Hey AI, check if this feature exists in the code." - AI looks and reports back: yes or no.
"OK it exists. Is it good enough?" - AI reviews and reports back.
If the answer was no - "Go fix it." - AI makes the changes.
"Do we have tests for this?" - AI checks and reports back.
The script runs the tests itself. No AI needed for that.
Tests failed? "Figure out what broke and fix it." - Retry up to three times, then move on.

At no point does the AI decide what step comes next. It answers its question, hands control back, and the script picks the next move.

In pseudo code, the whole thing looks like this:

for each requirement on the checklist:

    answer = ask AI("Is this implemented?")

    if answer is no:
        ask AI("Implement it.")

    answer = ask AI("Do tests exist?")

    if answer is no:
        ask AI("Write the tests.")

    result = run the tests

    attempts = 0
    while result is failing and attempts < 3:
        ask AI("The tests failed. Here's the output. Fix it.")
        result = run the tests again
        attempts = attempts + 1

    if result is passing:
        check it off the list

    save progress

No fancy framework. No plugin system. Just a loop, some conditions, and an AI that gets called with a clear task each time.

What the actual calls look like

The "ask AI" above is a real command you can run in a terminal. This example uses Claude Code's headless mode, but the concept applies to any AI that supports non-interactive usage.

Asking a question (read-only). The AI can look at your code, but it can't change anything. You tell it exactly what shape the answer should have - a simple yes/no with a reason:

claude --print --permission-mode default \
  --json-schema '{ "implemented": boolean, "reason": string }' \
  "Is this feature implemented in the codebase?"

The script gets back something like { "implemented": false, "reason": "No token expiration check found." } and branches on it with a normal if/else. No guessing. No parsing free text.

Making changes. Now the AI is allowed to edit files, but still answers in a fixed format so the script knows what happened:

claude --print --permission-mode acceptEdits \
  --json-schema '{ "filesChanged": string[], "summary": string }' \
  "Implement this requirement: ..."

Simple one-off tasks. Sometimes you just need one small thing done, no structured answer needed:

claude --print --permission-mode acceptEdits \
  "Find the unchecked line in the report and check it off."

Three flags make this whole approach work:

--print - Non-interactive mode. One question in, one answer out. No conversation. The AI processes the prompt and exits. This is what makes it scriptable. See headless mode in the docs.
--permission-mode - Controls what the AI is allowed to do. default means it needs to ask before making changes (effectively read-only in a script). acceptEdits means it can modify files freely. The script decides which permissions each step gets. See permission modes.
--json-schema - Forces the answer into a fixed structure. This is what lets you write normal if/else logic around AI responses instead of hoping the output is parseable. See structured output.

Each call is self-contained. The AI doesn't remember the previous step. It doesn't know what comes next. It does its one job and hands back a clean answer.

You can also combine this with --output-format stream-json to get real-time streaming of the AI's thinking, tool usage, and results. This is useful for logging and monitoring - your script can watch exactly what the AI is doing at each step and write it to a log file for later debugging.

Why this matters

Predictability. You know exactly what will happen in what order. You can log every step. You can see where things went wrong. You can reproduce issues. With an autonomous agent, the path is different every time, and debugging becomes guesswork. When something breaks at step 4, you know it's step 4. You can re-run just that step. You can inspect the exact input and output. Try doing that with an agent that made 47 tool calls in an unpredictable order.

Scoped responsibility. Each AI call has one job. "Is this feature implemented? Answer yes or no, and explain why." That's it. The AI doesn't need to know about the other steps. It doesn't need to decide whether to write tests or fix code. It just answers its question. Smaller scope means fewer mistakes. It also means you can use different models for different steps - a cheaper, faster model for simple yes/no checks, and a more capable one for complex implementation tasks.

Structured output. When you tell the AI what shape its answer should take, you can write normal code around it. If the answer is "not implemented," the script branches to the implementation step. No parsing of free-text responses. No hoping the agent mentioned the right thing in the right format. No regex hacks to extract a yes or no from three paragraphs of explanation.

Retry logic that actually works. If tests fail, a script can say: "Try re-writing the tests first. If they still fail, fix the implementation. If it still fails after three rounds, give up and move on." That's a deliberate escalation strategy. An autonomous agent would have to figure this out on its own - and it usually won't. It might retry the same thing five times, or it might give up immediately, or it might start refactoring code that has nothing to do with the failure.

Cost control. Each call is a focused, minimal request. You're not paying for an agent that wanders through your codebase trying to figure out what to do. You're paying for a direct question and a direct answer, repeated in a loop you control. You can measure exactly how much each step costs, and optimize the expensive ones.

Permissions as a feature. With the permission mode flag, you can enforce that certain steps are read-only. The validation step can look at everything but touch nothing. Only the implementation step gets write access. This is a safety net you don't get when a single autonomous agent has all permissions at all times.

The mental shift

The industry is focused on making agents more autonomous. More tools, longer context windows, better reasoning. And those things help. But for repeatable production workflows, the biggest gains often come from going the other direction: less autonomy, more structure.

Don't build a smarter agent. Build a smarter process that uses a focused agent at each step.

This isn't about dumbing down AI. It's about recognizing that control flow is a solved problem. We've been writing scripts, pipelines, and state machines for decades. They're predictable, debuggable, and composable. AI is great at understanding context, generating content, and making judgment calls. Let each do what it's good at.

The irony is that this approach makes the AI more effective, not less. When you give an agent ten tools and a vague goal, it spends half its effort figuring out what to do. When you give it one clear task with a defined output format, it can focus entirely on doing that task well.

When to use which

Autonomous agents are great for open-ended, exploratory tasks. "Research this topic." "Help me brainstorm." "Answer my questions as I think through a problem." These are conversations, not pipelines. The unpredictability is a feature, not a bug.

But when you have a repeatable, multi-step process with clear success criteria - don't hand it to an agent and hope. Write the flow yourself. Call the AI where you need intelligence. Keep the decisions in code.

Your future self, debugging at 2 AM, will thank you.

How I used this: OAuth RFC compliance

OAuth 2.0 compliance is not glamorous work. You have a stack of RFCs, a list of requirements, and a codebase that may or may not already handle them. Working through that list manually - reading specs, finding relevant code, writing tests, updating docs - is exactly the kind of repetitive, judgment-heavy work where a single broad agent prompt falls apart.

I tried the naive approach first: hand Claude a requirement and tell it to "handle it." The results were inconsistent. Sometimes it wrote tests, sometimes it didn't. Sometimes it updated the docs, sometimes it ignored them. It made decisions I didn't ask it to make and skipped ones I cared about.

So I wrote a script instead. For each unchecked requirement in a compliance report, it runs a 17-step pipeline: check whether the requirement is already met, review that assessment, implement it if needed, verify with tests, document it, and commit. Claude handles each step. The script handles the order.

The pipeline is grouped into phases. Classification and review: three calls that search the codebase and determine what actually needs work - does this belong to the library, to the host application, or is it deprecated? Implementation: if code is missing, Claude writes it. Deprecated features like the Implicit Grant flow get gated behind a config flag named dangerously_enable_* so the risk can't be missed. Testing: Claude checks for existing tests, reviews their quality, writes new ones if needed, and gets up to three attempts to fix failures. Documentation: Claude writes bundle docs, filtered to avoid documenting the obvious - only non-obvious behavior, integration points, and deprecation warnings make the cut.

Each call has scoped file access. The implementation agent can only touch src/. The test agent can only touch tests/. The docs agent can only touch docs/. This is enforced at the call level, not left to the AI to decide.

The result is a pipeline that processes dozens of RFC requirements consistently, with full logs of what happened at every step, without me sitting there supervising each one. It's not impressive because of what Claude does. It's impressive because the script never lets Claude do the wrong thing at the wrong time.