Hey y'all I'm starting a new series on how production ai agents actually work under the hood
In this series I'll cover some topics in 4 phases
- Runtime internals
- memory and states
- multi agent orchestration
- production systems
Breaking down how systems like Claude, Cursor, etc. are actually architected.
Starting from tomorrow
here is where Swarm breaks and it breaks in specific, predictable ways.
TERMINATION PROBLEM
without a supervisor to decide "we are done,"
the system needs explicit exit conditions.
max iterations. quality thresholds. timeout-based
convergence. if none of these are defined carefully,
agents hand off in a loop indefinitely. too aggressive a termination condition produces incomplete results. too conservative burns tokens until the budget runs out.
this is the most common production failure in Swarm based systems. teams define the agents. they forget to define when to stop.
DEBUGGING PROBLEM
with a Supervisor Pattern, the execution history is centralized. one node made every routing decision. one place to look.
with Swarm, tracing a failed task means
reconstructing the handoff chain from distributed logs.which agent had control at step 7.
what context it received. what it decided to hand off. one article compared it to debugging an eventually consistent distributed database you need distributed tracing tooling from day one,not as an afterthought.
WHEN SWARM ACTUALLY WINS
> exploration tasks where the optimal path is unknown and no supervisor could predetermine the routing
> customer service triage where query type determines routing and no global state is needed
> tasks where agents are genuine peers with equal authority no natural hierarchy exists
when tasks have strict ordering requirements, need transactional guarantees, or require a global view of progress use Supervisor.
the pattern is not better or worse than Supervisor. it is the right tool for a different class of problem.
Sources:
https://t.co/kf0opf600R
https://t.co/M9RjFhrYB8
https://t.co/62m8rmU2qG
https://t.co/0L1oDGkjcV
Day 13/20 of AI Agent Systems Series
ARC 3: Multi-Agent Orchestration - The Swarm Pattern
the Supervisor Pattern introduces a central coordination point: the supervisor.
if the supervisor makes a bad routing decision,every downstream agent acts on it . if it becomes a bottleneck at scale, every worker waits behind it.
the Swarm Pattern removes the supervisor entirely.agents hand off to each other based on context,with no central coordinator deciding who goes next.
OpenAI popularized this pattern through the Swarm framework in October 2024. Swarm was later superseded by the OpenAI Agents SDK, which kept the same core handoff model while adding production features. same conceptual model, adds guardrails, tracing, and TypeScript support on top.
the pattern survived the deprecation. the primitives are in production at scale.
but most people have a fundamental misconception about how it works.
the misconception first: Swarm is not parallel.
this catches most people.
in the Swarm pattern, only one agent is active at any given time. it is sequential control transfer, not concurrent execution. agent A runs, decides it needs to hand off, passes control to agent B, agent B runs. one active agent throughout.
fan-out parallelism multiple agents running simultaneously on independent sub-tasks requires a coordinator. Swarm explicitly removes the coordinator, so it does not give you parallelism. it gives you decentralized sequential routing.
here is how the pattern actually works.
two primitives only:
AGENTS
an agent is a system prompt plus a list of functions. that is it. the functions define what the agent can do and who it can hand off to.
HANDOFFS
a handoff is a function that returns a different agent object. when the current agent calls that function,the framework switches the active agent and continues.
the entire API surface of the original Swarm:
> define agents with instructions and tools
> define handoff functions that return other agents
> call run() with a message
> the framework routes through agents until done
the framework is stateless no persistent state between calls. every handoff must carry all the context the next agent needs in the conversation history. no hidden state. no memory between runs.
OpenAI Agents SDK (March 2025) added what
original Swarm deliberately left out:
> guardrails (input/output validation)
> built-in tracing per handoff
> persistent state management
> TypeScript support
the pattern is identical. the production runtime
is the Agents SDK, not the original Swarm repo.
Windsurf is now Devin Desktop
Same Windsurf ide just unified under Devin
> Manage all your local and cloud agents from one Kanban view
> Introduces Spaces group sessions, PRs and files so agents share context
> Supports ACP run Codex, Claude Code, OpenCode or your own agents inside it
> Plan locally, hand off to cloud Devin keeps working after you close your laptop