The System Design Newsletter

The System Design Newsletter

What Makes an AI Agent Different From ChatGPT?

#111: AI Agents Explained: How They Go From Instructions to Action

Sairam Sundaresan's avatar
Neo Kim's avatar
Sairam Sundaresan and Neo Kim
Jan 05, 2026
∙ Paid

Get my system design playbook for FREE on newsletter signup:

  • Share this post & I'll send you some rewards for the referrals.


Imagine you maintain a scraper that checks flight prices every morning before your team’s standup.

It worked fine yesterday: navigate to the site, click the “Cheapest” toggle, collect results, and post to Slack. Simple and predictable.

Today, it fails… silently.

You dig in.

Airline redesigned its UI overnight. The button that said “Cheapest” now says “Lowest fare”, so your CSS selector points at nothing. The logic of your task (finding a cheap flight) remains valid. But your instructions assumed a world that no longer exists.

You fix it.

A week later, something else changes. You fix that, too. This is brittle automation; it works perfectly in narrow conditions but breaks if anything shifts.

Consider a different approach:

Instead of writing code that clicks a specific element, you give an AI agent1 a goal: “Find the cheapest flight to Tokyo this month”.

In the best case, the agent can infer that “Lowest fare” is equivalent to “Cheapest” and choose the right control2. Semantic robustness doesn’t happen automatically. Without explicit fallbacks and label-matching strategies, agents fail when UIs change. Even minor layout shifts, pop-ups, or accessibility features break them.

Comparing Agents against Robotic Process Automation (RPA) helps…

RPA scripts follow predetermined steps, while agents try to understand what they see. But agents still depend on page structure and labels, not real human understanding, so they break on messy pages.

This is the shift:

Instead of telling software how to do something step by step, you tell it what you want. The agent figures out how.

That’s not magic.

Behind the scenes, the agent breaks your goal into steps, tries them out, watches the results, and adjusts. When something fails, it can retry if you’ve set up error handling and backup plans. Without those, agents often loop on the same failure or stop at a step limit.

The difference from traditional code isn’t that agents are smarter. It’s that they’re designed to navigate ambiguity rather than demand certainty.

In this newsletter, we’ll open the hood on how agents actually work.

We’ll follow one example: booking a flight.

This will show how agents work, how they think, what tools they use, how they remember, and what breaks in real use. By the end, the mystique should be gone.

You’ll see the machine for what it is.

Onward.


🎁 Giveaway Alert (Valid for 24 hours only)!

Win a copy of the software engineering classic book—The Pragmatic Programmer.

To get it for FREE:

  1. Sign up for CodeRabbit by clicking here

  2. Fill out this tiny form

(Three winners will be selected at random.)


Cut Code Review Time & Bugs in Half (Sponsor)

Code reviews are critical but time-consuming.

CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.

Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.

CodeRabbit has so far reviewed more than 10 million PRs, has been installed in 2 million repositories, and has been used by 100 thousand open-source projects. CodeRabbit is free for all open-source repos.

GET STARTED NOW


I want to introduce Sairam Sundaresan as a guest author.

He’s an AI engineering leader with 15+ years of experience building AI and ML systems in industry. He likes to teach AI through illustrations and stories, and he turned the same style into a published Bloomsbury book, AI for the Rest of Us.

Check out his book and writing:

  • AI for the Rest of Us (Book)

  • Gradient Ascent (Substack)

  • LinkedIn

  • X

These days, he leads AI work, advises teams on practical systems, and writes and illustrates to make AI feel less like magic and more like engineering.


What Is an Agent?

Start with the simplest distinction: a chatbot talks; an agent does.

Ask a ‘chatbot’ to book you a flight, and it explains how: what to search, what to compare, maybe a link. Helpful, but nothing changes.

Ask an ‘agent’ to book a flight. It opens a browser, calls APIs, compares prices, and filters results based on your constraints. It checks whether seats are available. If a payment error happens, it tries again. Once successful, it confirms the booking and sends you the receipt. When it’s done, you'll have a ticket.

That side effect (changing the world) is what separates agents.

The Digital Intern

A helpful way to think about agents: treat them like a smart but inexperienced intern.

You wouldn’t micromanage an intern’s keystrokes. Instead, you’d say:

“Book me a flight to Tokyo, under $500, in the next month. Use the corporate card. Email me the confirmation.”

You set the goal and give the agent what it needs to work: access to the travel site, payment details, and an email address. You also set limits, such as your budget, airline preferences, and a rule against red-eye flights. They figure out the rest…

An agent works the same way.

You don’t script every click. You specify what success looks like and what’s off-limits. The agent handles the intermediate decisions.

But here’s the caveat the analogy misses: an intern makes mistakes at human speed. One task, natural pauses, is easy to catch.

An agent’s mistakes scale instantly. Give it access to your email with a vague instruction like “follow up with leads,” and it might send 200 bizarre messages before anyone notices.

Agents need real boundaries enforced in code: rate limits, allow lists, and verification gates. Telling them in prompts isn’t enough.

What Makes Something an Agent

Three characteristics define the difference:

1. Autonomy

It means the agent figures out intermediate steps on its own. You said, “Book a flight under $500.”

The agent determines what needs to happen.

Things fail: airline sites time out, fares disappear. The agent can adapt if you’ve given it recovery strategies. Otherwise, it gets stuck or keeps trying the same thing.

2. Proactiveness

It means taking the initiative when information is missing.

Imagine the agent finds two flights at the same price with different layovers:

Can it check your calendar or ask you? Only if you’ve permitted it and told it to ask questions. Otherwise, it picks randomly. It doesn’t just guess and hope.

Recognition of missing input is a design feature (e.g., uncertainty thresholds and “ask‑for‑help” triggers), not an innate property.

3. Action

Taking action through tools distinguishes agents from chatbots.

An agent calls APIs, automates browsers, executes code, and sends emails. Of course, it’s subject to strict scoping and validation. These capabilities are mediated through defined interfaces: tools with clear inputs and outputs.

The agent decides what to do; the tools make it happen in the real world.

Specifying Your Agent: A Checklist

Before you build, you need to answer four questions…

There’s a framework from classical AI called PEAS that makes this concrete.

Using our flight booker example:

1. Performance:

How do you know it worked?

“Round-trip to Tokyo, under $500, departing within 30 days, confirmation email received.”

2. Environment:

Where does it operate? Public airline sites, a flight aggregator API, and your corporate email system.

What’s in bounds, what’s not? Can it access your calendar? Your credit card? Define the playing field.

3. Actuators:

What can it do? Search flights, click buttons, fill forms, make purchases, and send emails.

Each capability is a tool you’re handing over. More tools mean more power, and more risk.

4. Sensors:

How does it perceive the results? Reading page content, parsing API responses, and detecting confirmation numbers in emails.

If it can’t observe the outcome of an action, it can’t reason about what to do next.

Get these wrong, and the agent wanders! Get them right, and you have a system you can actually build and debug.


The Brain

A common mistake: treating Large Language Model3 (LLM) as a knowledge base.

Think of the LLM as a processor, not as a hard drive. It takes input, reasons, and produces output. Unlike a CPU, it’s probabilistic: good with messy language, imperfect with precise facts.

For our flight booking agent, this means: don’t expect the LLM to “know” which airlines fly to Tokyo or current prices. It doesn’t.

Instead, it can propose next steps: search for flights, compare prices, and check seat availability. It can try to interpret what it sees.

But this breaks easily on complex pages. You fix this by using APIs instead of raw HTML, adding accessibility features, and setting guardrails.

The practical implication: the LLM is the reasoning engine, not the source of truth.

When your agent needs facts (current prices, seat availability, your company’s travel policy), it gets them from tools and memory. The LLM’s job is to figure out what to look up, how to interpret results, and what to do next. Reasoning, not retrieval.

Living with Non-Determinism

Ask an LLM the same question twice, and you might get different answers.

Not wrong answers, just different phrasings, different emphasis, occasionally different conclusions. This isn’t a bug. It’s how probabilistic systems work.

You can dial this down.

A parameter called “temperature” controls randomness: lower means more predictable. But you’ll never get the perfect repeatability of traditional code, and that’s okay.

This means you design differently:

  • For ‘low-stakes’ decisions (which search results to show first), variance is fine.

  • For ‘high-stakes’ actions (actually purchasing a flight), you add verification.

Before the agent charges your card, it should confirm key details: flight, price, and total. Then proceed. The fuzzy reasoning can propose actions. The final execution needs checkpoints4.

The Context Window: Working Memory

If the LLM is the processor, the context window5 is its RAM.

This is everything the agent can ‘see’ at once. It includes your original request, the conversation so far, recent tool outputs, and any instructions you’ve given it.

Current models range from around 128,000 to over 1 million tokens of context. Running out of space isn’t the main concern anymore.

The real limit is attention:

As the context grows, the agent has trouble finding and using specific details buried in the middle.

Your flight agent searches three airlines. Each returns a page of results: flight times, prices, layovers, seat classes. Add the original instructions, the reasoning so far, and suddenly you’re managing memory.

When the context window fills up, the agent effectively forgets. Old information drops off. If the agent searched Airline A ten steps ago and now needs to compare it to Airline C, those Airline A results might be gone.

Context collapse is a major problem…

The agent loses track of earlier constraints, such as budget caps or airline preferences.

Fix this by summarizing information, saving details separately, and repeating your key rules at each step.

Here’s how it works with flights:

  • Compress 47 results into the top 3 that fit your budget.

  • Save full details separately.

  • Restate your constraints before the agent completes the purchase.


The Loop

The engine that makes agents work is called ReAct6: Reason plus Act.

It’s simpler than it sounds.

Let’s trace it through our flight booking:

You’ve told the agent: “Book me a round-trip flight to Tokyo, under $500, in the next month.”

The agent doesn’t immediately start clicking around. First, it reasons: “Tokyo, under $500, flexible dates. I’ll search flights first.”

Then it acts: it calls the flight search API with the following parameters (destination: Tokyo, dates flexible, max price: $500).

The API returns results. But these results don’t go straight to you. They go back to the agent for an observation. The agent now sees: “12 flights found. Prices range from $420 to $890. Six are within budget.”

Now it reasons again: “I have options within budget. But I should check which ones have good layover times. Let me filter for flights under 20 hours’ total travel time.”

It acts again: filters the results, or maybe calls another tool to get more details on the top candidates.

This cycle (reason, act, observe, reason again) continues until the agent reaches the goal:

  • Search flights.

  • Observe results.

  • Filter options.

  • Observe the filtered list.

  • Check seat availability on the top choice.

  • Observe that seats exist.

  • Initiate purchase.

  • Observe confirmation.

  • Send email.

  • Done.

The key insight:

Agent isn’t following a pre‑written script; each step depends on prior observations.

ReAct helps agents adapt, but it’s not reliable.

Early mistakes (misread prices, missed constraints) pile up and break later steps. Mitigate with checkpoints, structured plans for known phases, and verification before irreversible actions.

When the Path Is Known

ReAct shines when things are uncertain:

Will the search find anything?

Are seats available?

What errors will pop up?

You don’t know, and ReAct can adapt. But sometimes you know exactly what needs to happen.

An alternative pattern is called Plan-and-Execute7.

The agent writes out a complete plan upfront:

  1. Search Tokyo flights for next month.

  2. Filter for under $500.

  3. Rank by total travel time.

  4. Check seat availability on the top 3.

  5. Book the best available.

  6. Email confirmation.

Then it executes each step in sequence.

This is faster: fewer round-trips to the LLM, less “thinking” between steps. And the plan is reviewable. You can see what the agent intends before it starts.

The downside: rigidity.

If step 3 reveals something unexpected (say, all flights under $500 have 30-hour layovers), the plan doesn’t adapt. The agent isn’t reasoning between steps, so it can’t course-correct without scrapping the plan and starting over.

The Hybrid Approach

In practice, most systems combine both:

You create a plan with clear checkpoints and ways to handle failures. Within each checkpoint, you let the agent think through problems, but with limits.

  • “Search and filter” is one phase; within that phase, the agent uses ReAct to handle the messiness of search results.

  • “Book and confirm” is another phase; again, ReAct handles the payment flow, error recovery, and confirmation.

You get the structure of a plan with the adaptability of dynamic reasoning.

The flight agent knows the overall shape of what it’s doing, but can navigate surprises within each segment.

Checkpoints are up to you. You might add one between search and purchase, where a human or system verifies the choice before buying. Without it, runaway errors are common.


Tools and Memory

An LLM by itself can only produce text.

To actually book a flight (to search, click, purchase, email), it needs tools8.

Here’s what actually happens when an agent uses a “tool”:

The LLM doesn’t browse the web or execute code directly. Instead, it uses native tool calling. You define a set of allowed functions in your API request, and the model switches into a specialized mode to “call” them.

  • It stops writing and makes a structured request.

  • It asks for a specific function and provides the exact information it needs.

  • External code then takes that JSON, validates it, executes the real operation, and returns the result.

For our flight agent, you might define a tool like this:

{
  “name”: “search_flights”,
  “description”: “Search for available flights”,
  “parameters”: {
    “origin”: “string”,
    “destination”: “string”,
    “departure_date”: “string”,
    “return_date”: “string”,
    “max_price”: “number”
  }
}

When the agent decides to search for flights, it doesn’t somehow access an airline’s website. It outputs:

{
  “name”: “search_flights”,
  “parameters”: {
    “origin”: “SFO”,
    “destination”: “TYO”,
    “departure_date”: “2025-01-15”,
    “return_date”: “2025-01-22”,
    “max_price”: 500
  }
}

Your code receives this:

  • Validates it

    • Are all required fields present?

    • Are the types correct?

  • Calls the actual flight API.

  • And returns results to the agent.

Sometimes the agent makes up parameters, like asking for ‘seat preference: window’ when that option doesn’t exist. Good schema validation catches this and returns an error.

The system should then try a different approach rather than just repeating the same mistake.

Memory: Beyond the Conversation

The context window is the agent’s short-term memory: what it can hold in mind during a single task.

But what about information that spans sessions or exceeds what fits in context?

This is where “external memory” comes in.

The dominant approach is called RAG9 (Retrieval-Augmented Generation). Don’t load everything into the context window at once. Store information in a database and fetch only what’s relevant for what the agent is doing right now.

For our flight agent, imagine it needs to check your company’s travel policy:

The full policy is 50 pages, way too much for context. Instead, the policy is stored in a vector database10. When the agent needs to know the maximum flight budget for Asia, it searches the database.

The database returns the relevant section: “Asia-Pacific travel: economy class, maximum $600 per segment, advance booking required.”

That excerpt gets injected into the context. The agent reasons with it, then moves on.

The interesting part is how retrieval works:

Traditional databases match keywords. Vector databases match meaning. They convert text into numbers. Similar concepts end up close together in this number system. This means ‘budget limit’ and ‘maximum cost’ are treated as similar ideas.

When you search for ‘budget’, you’ll find sections about ‘maximum cost’, even if they don’t use the word ‘budget’.

This approach lets agents handle lots of information without overloading the system.

Company policies, past bookings, and user preferences are all stored separately and retrieved only when needed. The agent knows how to ask for what it needs. The retrieval system finds it.

For flight booking, memory might include:

  • Short-term: current search constraints, options found so far, steps taken

  • Long-term: your airline preferences (aisle seat, no red-eyes), corporate travel policy, past trips, and their costs

The agent accesses long-term memory as needed. Short-term memory stays focused on the current task.

When the context window fills up, the agent summarizes or discards older information.


When Things Get Complex

One agent can handle a lot. But push the scope far enough, and you hit a ceiling.

Ask a single agent to handle everything at once:

  • Research flight options.

  • Check your calendar for flexibility.

  • Review the corporate travel policy.

  • Compare loyalty programs.

  • Book the best flight.

  • Draft the expense report.

All from one prompt. The agent starts to stumble…

Instructions get forgotten. Reasoning gets muddled. It might book a flight that violates policy because by the time it got to booking, it had lost track of the policy constraints.

This is cognitive overload: prompts too long, responsibilities too varied, context too fragmented.

Without structured state and role separation, agents forget constraints and misapply tools.

The solution: specialization.

Instead of one agent doing everything, you use multiple agents with a narrow focus.

Manager and Workers

The pattern that works: a manager agent that plans and delegates, connected to worker agents that execute.

For our flight booking:

The manager receives the goal: “Book an optimal flight to Tokyo under $500, compliant with policy, with expense documentation.”

It doesn’t do the work itself. Instead, it breaks the task into pieces and assigns them:

  • Research agent: find flight options matching these constraints.

  • Policy agent: check the top options against our travel policy.

  • Booking agent: purchase the compliant option and send confirmation.

Each worker has a focused job and limited tools…

  • Research agent has access to flight search APIs but can’t make purchases.

  • Booking agent can purchase, but doesn’t need to search.

  • Policy agent has access to the company policy database, but nothing else.

The manager coordinates handoffs with explicit schemas:

research → policy → booking

When something fails (no compliant flights found), the manager follows defined backup plans. It might expand the date range or relax constraints. It doesn’t just improvise.

This mirrors how human organizations work.

The manager doesn’t need to know how to query the flight API. They need to know what information is needed and who can get it.

Each specialist goes deep on their piece without being distracted by the rest.


When Not to Use Agents

Agents are powerful, but they’re not always the right tool.

Sometimes they’re overkill… Sometimes they’re the wrong shape entirely.

Deterministic, unchanging workflows:

If the flight API is stable and you use the same search, filters, and booking flow every time, stick with a regular script. It’s faster, cheaper, and more reliable.

Agents shine in ambiguity. No ambiguity11, no advantage!

Hard real-time requirements:

Agents are slow.

Multiple LLM calls, tool execution, reasoning loops: this takes seconds, sometimes tens of seconds. If you need a response in 100 milliseconds, agents aren’t the architecture.

When you can’t afford errors:

Agents make mistakes.

They hallucinate, misinterpret, and take wrong turns. For most tasks, you catch and correct. For some (financial transactions with no undo, medical decisions, legal filings), the error tolerance is zero. Agents can assist humans in these domains.

They shouldn’t act autonomously.

Trivial tasks:

“Convert this temperature from Celsius to Fahrenheit.”

You don’t need an agent. You just need a function.

Ask one question:

Will this task require handling ambiguity, adapting to changes, or making judgment calls?

If yes, agents make sense. If not, you probably want something simpler.


The Reality Check

Everything so far works in demos. Production is harder.

Three problems dominate:

1. Latency

Agents are slow.

A single ReAct cycle might involve three or four calls to an LLM, each taking one to three seconds.

Your flight agent searches, reasons, filters, reasons, checks availability, reasons, books. That’s potentially 20+ seconds before the user sees a result.

The mitigation is transparency…

Stream the agent’s progress: “Searching flights...” “Found 12 options, filtering by price...” “Checking availability on top choices...” “Booking confirmed.”

The user sees motion, not a frozen screen.

This doesn’t make the agent faster. It makes the latency tolerable.

2. Cost

Every reasoning step burns tokens.

Every tool call has overhead. Say an agent explores every option thoroughly. It compares airlines, tries different date combinations, and checks policy compliance. That single task could use 50,000 tokens. At current API prices, that’s real money. Multiply by thousands of users, and the costs compound12.

The mitigation is caching.

Imagine someone searched for flights to Tokyo yesterday. Today, another person searches with similar requirements.

Does the agent need to think through everything again? No. A semantic cache recognizes that “flights to Tokyo under $500” and “Tokyo flights, budget $500” are the same query. It returns the cached result. The expensive agent never runs.

For information that changes (such as flight prices), you set expiration times. For information that doesn’t change (such as your company’s travel policy), the cache lives longer.

The goal: avoid redundant reasoning without serving stale results.

3. Reliability

Agents fail in ways that don’t show up in stack traces.

Hallucination:

The agent confidently asserts something false.

  • It “finds” a $300 flight that doesn’t exist.

  • It “confirms” a booking that never went through.

It’s generating plausible outputs based on patterns, and sometimes, plausible isn’t accurate.

Loops:

The agent gets stuck.

It tries a tool, sees an error, tries the same tool the same way, and sees the same error. Without intervention, it loops until context fills up or the budget is exhausted.

Context collapse:

Midway through a complex task, the agent loses track.

It forgets the budget constraint and books a $900 flight. It forgets it already searched Delta and searches again. The context window filled up; earlier information dropped off.

The mitigation of these problems is operational13:

Trace everything.

Not just inputs and outputs, but the full chain of reasoning.

Every thought, every tool call, every observation. When something fails, you need to see where the logic derailed. “The agent booked the wrong flight” isn’t debuggable. “The agent correctly filtered to three options, then hallucinated that Option 2 was cheapest when Option 1 was” tells you what to fix.

Force grounding14.

When the agent makes a factual claim, require it to cite the source:

“This flight is $450 (source: search result #3, timestamp 10:42 am).”

If it can’t point to where it got information, flag the response as unverified. This doesn’t eliminate hallucination15, but it makes hallucination detectable.

Add checkpoints.

Before irreversible actions (such as charging a card or sending an email), pause for confirmation.

The agent proposes; a human or verification system approves. This can catch errors before they become problems, but only if you implement it; it’s not a default feature.

These practices (strict validation, grounded citations, confirmation checkpoints) are the baseline safety stack. But these protections aren’t enough when the stakes are high. You also need defenses against prompt injection, tool misuse, and logic errors. And you must strictly limit what the agent can access and do.

Even with ReAct and these mitigations, complex multi‑step tasks still fail frequently because of error propagation. Design for containment with timeouts, budgets, and circuit breakers. Build in observability. Require human review for irreversible steps.

None of this is magic.

It’s engineering: logging, validation, caching, and confirmation steps.

The agent’s probabilistic nature means you can’t test your way to certainty. You build systems that handle uncertainty gracefully.


Closing

We started with a broken scraper, a script that crashed the moment a button changed its name. The agent approach doesn’t eliminate complexity... It relocates it.

Instead of hard-coding every possible path, you define goals, provide tools, and set boundaries. Then you let the agent figure out the rest.

The flight booker we traced isn’t magic. It combines four components. These include a reasoning loop (ReAct), tools for taking action, memory for context, and constraints such as validation and budgets.

The LLM isn’t a database but a reasoning engine: fuzzy, probabilistic, sometimes wrong. You pair it with retrieval for facts, validation for safety, and tracing for debugging. You design for the variance instead of pretending it doesn’t exist.

That’s the mental model.

The specific frameworks will evolve: new tools, better models, different patterns. But the core architecture won’t: a reasoning engine that thinks, acts, observes, and adapts. The mystique is gone.

Now you see how the machine works.


👋 I’d like to thank Sairam for writing this newsletter!

Plus, check out his book and writing:

  • AI for the Rest of Us (Book)

  • Gradient Ascent (Substack)

These days, he leads AI work, advises teams on practical systems, and writes and illustrates to make AI feel less like magic and more like engineering.


I launched Design, Build, Scale (newsletter series exclusive to PAID subscribers).

When you upgrade, you’ll get:

  • High-level architecture of real-world systems.

  • Deep dive into how popular real-world systems actually work.

  • How real-world systems handle scale, reliability, and performance.

10x the results you currently get with 1/10th of your time, energy, and effort.

👉 CLICK HERE TO ACCESS DESIGN, BUILD, SCALE!


If you find this newsletter valuable, share it with a friend, and subscribe if you haven’t already. There are group discounts, gift options, and referral rewards available.


Author Neo Kim; System design case studies
👋 Find me on LinkedIn | Twitter | Threads | Instagram

Want to reach 200K+ tech professionals at scale? 📰

If your company wants to reach 200K+ tech professionals, advertise with me.


Thank you for supporting this newsletter.

You are now 200,001+ readers strong, very close to 201k. Let’s try to get 201k readers by 11 January. Consider sharing this post with your friends and get rewards.

Y’all are the best.

system design newsletter

Share

This post is for paid subscribers

Already a paid subscriber? Sign in
Sairam Sundaresan's avatar
A guest post by
Sairam Sundaresan
Author of AI for the Rest of Us and AI engineering leader with 15+ years of industry experience. I help tech professionals break into AI and companies build valuable products through my writing and teaching.
Subscribe to Sairam
© 2026 Neo Kim · Publisher Privacy
Substack · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture