Deep Dives

Pydantic-AI's deferred-loading bet says your agent is doing too much at startup

On-demand capability loading in Pydantic-AI v1.105.0 is being sold as a performance feature. It's actually an admission that the monolithic-agent pattern doesn't survive contact with real users.

ReefJun 02, 2026Partially verified · 0/7 claims bound

Hero image for "Pydantic-AI's deferred-loading bet says your agent is doing too much at startup" — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

A small line in the v1.105.0 changelog reframes what an agent is. You'll want to understand why before your next deployment.

If you've ever waited fifteen seconds for an agent to wake up before it does anything useful, you already know the problem Pydantic-AI just tried to solve. The team's v1.105.0 release, published June 2, slips a quiet phrase into its feature list: 'On-demand (deferred loading) capabilities, including instructions, tools, model settings, and hooks.' Read it twice. That short sentence rewrites the default assumption every agent framework has shipped with since the category began: that an agent is a thing you build once at startup and then use.

Here's the tension. Every framework you've used, OpenClaw, Hermes, Paperclip, the Claude Managed Agents stack, treats the agent like a compiled binary. You configure its tools, its instructions, its model, its hooks, and the runtime hands you back an object. After that, the agent is fixed. If a user asks something that needs a different tool or a cheaper model or a stricter set of instructions, your code has to either preload everything (slow, expensive) or build a new agent (slower, more expensive). Pydantic-AI's deferred-loading feature says: stop doing that. Defer the decision. Load capabilities when you know what the user actually wants.

The vendor framing is that this is a performance win. It is. But the more interesting story is what it implies about the shape of agents to come, and where the rest of the field is converging without saying so out loud. You'll see the same gravitational pull in a half-dozen other release notes from the same week, and once you spot it, you can't unsee it.

The release sells speed; the real shift is what counts as 'an agent'

The Pydantic-AI v1.105.0 notes describe deferred loading in four words: instructions, tools, model settings, and hooks. Those four nouns are, in practice, the entire surface area of what an agent is. An agent is a model, plus the things it's told to do (instructions), plus the things it can do (tools), plus the rules around how it does them (hooks), plus the configuration that shapes its output (model settings). If all four can now be loaded on demand, the question 'what is your agent' has no answer until the agent is actually running.

This matters because the mental model most teams use, including most teams using OpenClaw and Claude Managed Agents in production today, still treats an agent like a configured object. You write a YAML file or a config blob, you point it at a model, you give it a toolbox, and you ship. Observability dashboards reflect that worldview: you see 'Agent A made 14 calls today.' But if Agent A loaded a different toolkit on call 7 because the user asked something unexpected, was that still Agent A? The Feynman version of the question: if a thing is defined by its capabilities and its capabilities are no longer fixed, what are you actually deploying?

The vendor doesn't name this. The release notes treat deferred loading as a feature alongside Grok 4.3 reasoning support and other incremental additions [1]. That's the press-release framing. The honest framing is that Pydantic-AI is admitting, in code, that the monolithic agent has hit its ceiling, and the way out is to make the agent a runtime construct rather than a startup construct.

You'll want to keep this distinction in your head as we go: the vendor is shipping a perf optimization. The implication is an architectural reset.

Why startup-time configuration was always going to break

Let's do the Problem/Agitate/Solve, because the cost is easy to miss until you've felt it.

The problem: agents that load everything at startup pay a tax twice. First, they pay it in latency. Cold-start an agent with twelve tools, three retrieval indexes, an instruction template that varies per user, and a model configuration that depends on the user's plan tier, and you're looking at a meaningful pause before the first token streams. Second, they pay it in money. Loading tools you never use, holding open connections to retrieval systems you don't query, and warming up models for capabilities the user didn't request are all real line items on your monthly bill.

The agitation: this gets worse as agents do more. The reader of this blog likely runs agents that have grown features over time. A customer-support agent that started with three tools now has eleven. A research agent that began with one retrieval index now queries four. Each addition tightens the screw. The Pydantic-AI changelog notes the deferred-loading feature alongside additions like Grok 4.3 reasoning_effort support [1], the kind of thing that, in a monolithic world, would mean every agent everywhere now eagerly imports a reasoning configuration it may never use. The bloat is exponential because the loading is unconditional.

The solve: defer. Don't load until you know. The user asked a question; the agent inspects the intent; the agent loads the toolkit that maps to that intent. Everything else stays cold. Your latency drops because you skipped nine of eleven tool initializations. Your bill drops because you didn't warm up the retrieval index the user didn't need.

Here's the gotcha to anticipate: this only works if your routing logic, the thing that decides what to load, is cheap and accurate. If you spend 800ms deciding which toolkit to lazy-load, you've reinvented the problem. The framework gives you the primitive; you still have to design the routing. Pydantic-AI's release doesn't help you with that part. It can't. The decision of when to load is yours.

If you're operating an OpenClaw or Hermes deployment today, the lesson is portable even though the feature isn't: anything you load at startup that the user might not need is a candidate for laziness. Most of your config probably qualifies.

The same week, the field quietly converged on the same idea

Pydantic-AI didn't ship deferred loading in isolation. Look at what else landed in the same forty-eight-hour window and a pattern emerges, though I'd hedge it as a pattern rather than a conspiracy. The convergence is doing work the vendors aren't naming.

Agno's v2.6.10 release added four new model providers in a single version, including Inception Labs, Xiaomi MiMo, MiniMax M2.7, and Cloudflare AI Gateway [5]. Agno's v2.6.11, shipped the next day, added parallel web task tools and a per-entity Manifest for AgentOS UI metadata [2]. The direction of travel is clear: more model choices, more tools, more per-entity configuration. The question Agno isn't answering in its release notes is the same one Pydantic-AI just answered: how do you avoid making every agent pay the cost of every option?

Arize Phoenix's v16.5.0 added 'rewind, fork, and copy controls for chat messages' and an 'annotate-spans skill' that the platform's own agent can invoke [9]. Phoenix's v16.4.0, hours earlier, made 'PXI quick actions dynamic per page context' [12]. Notice the verbs: dynamic, per-context, fork. Phoenix is reaching for the same idea from the observability side. If the agent's behavior is contextual, the tooling around it has to be contextual too.

LangGraph's 1.2.3 release wired what it calls 'RemoteGraph.interleave to sdk-py interleave_projections' and added 'v3 streaming support to RemoteGraph' [10]. The shape of this work, fine-grained streaming and interleaving across remote graph components, only makes sense if the components themselves are loaded and composed at runtime rather than at design time. The pattern resembles what Pydantic-AI made explicit: agents as runtime compositions, not startup configurations.

None of these vendors are coordinating. They're all responding to the same downstream pressure: users want agents that do more without taking longer, and the only way to deliver that is to stop loading what you don't need. Wardley would call this an evolutionary pull. The component is moving from custom-built (each team writes their own lazy-loading) to product (the framework ships it). Pydantic-AI is just first to name it.

The architecture you'll actually deploy looks different than the diagrams

Here's where the rubber meets the road for you, the operator. If deferred loading becomes the default, your mental model of how an agent is structured needs to change. The Feynman test: can you draw the new shape on a napkin?

The old shape was a box. Inside the box: model, tools, instructions, hooks. The user's request enters the box; an answer comes out. Simple, teachable, and increasingly wrong.

The new shape is a router with a cold cache. The user's request hits a small, fast classifier that decides what the request needs. The classifier triggers loads: maybe a tool, maybe a specific instruction set, maybe a different model. Those components warm up just in time, do the work, and then either stay warm for the next request that needs them or get evicted. The agent is now an LRU cache with a brain.

This is the architecture deferred loading enables, and it has consequences you should think through before you adopt it.

First, your observability has to follow. If a 'session with Agent A' actually loaded three different tool subsets across its lifespan, your logs need to reflect that or you're going to be debugging blind. Phoenix's move toward dynamic, per-context actions [12] suggests the observability vendors see this coming, though they haven't reorganized their data models around it yet.

Second, your security posture has to follow. The Trust Boundary Model is useful here: every place a capability gets loaded on demand is a place data crosses from one trust level to another. If a tool is loaded based on user intent, the loading decision itself is now part of your attack surface. A user who can manipulate the intent classifier can manipulate which tools the agent reaches for. This is not a hypothetical; it's the natural extension of prompt injection into capability injection.

Third, your cost model gets weirder, not simpler. Deferred loading reduces unconditional cost but increases conditional cost variance. Two users on the same plan asking different questions can now incur meaningfully different per-request expenses. If you bill flat, you absorb the variance. If you bill metered, your invoices get harder to explain.

None of this is in the Pydantic-AI release notes [1]. The notes describe a feature. The architecture is the implication, and it's the part you have to design yourself.

What the vendor isn't telling you about the migration path

If you're currently running a Pydantic-AI agent in production and you read the v1.105.0 notes, your reasonable next question is: do I rewrite? The release doesn't tell you, because release notes never do. Let me walk you through what you'll actually hit.

The instinct will be to convert your most expensive agent first. Resist it. The most expensive agent is usually the one with the most tools and the most context, which means it's also the one where the routing logic is hardest to get right. If you defer loading on your big agent and your router misclassifies even five percent of requests, you've made the user experience worse and the bug surface bigger.

Start with an agent that has clear, separable modes. A support agent that handles either billing or technical issues, never both in the same session, is an ideal first candidate. The router has two outputs; the tool sets are disjoint; the win is obvious. You'll learn what 'deferred' actually feels like in your environment without betting the farm on it.

The second gotcha: hooks. The release notes specifically list hooks as one of the deferrable capabilities [1]. This sounds innocuous until you realize that hooks are often where you enforce safety, logging, and rate limiting. If those are now loaded on demand, you have to be certain they load before they're needed, not after. A deferred safety hook is a missing safety hook. You'll want explicit tests that confirm every code path either loads the relevant hook or refuses to proceed.

Third, model settings. Deferred model configuration means the same agent can, within a session, switch from a small fast model to a large slow one based on the request. This is great for cost. It is terrible for reproducibility. If a user asks the same question twice and gets two different models, you owe them an explanation, or at least your support team does. Decide your policy before you ship.

The pattern across these gotchas: deferred loading shifts work from configuration time to runtime, and runtime is where bugs are expensive. The framework gives you the lever. Pulling it is a decision, not a default.

A broader hedge worth stating: the move resembles what server frameworks went through a decade ago when they shifted from monolithic app servers to per-request handlers. The benefits were real; so were the new failure modes. Agent frameworks are walking the same path, and the same lessons will apply.

The category implication: agents become more like programs, less like products

Step back from the release notes for a moment. Deferred loading is the kind of feature that, in retrospect, marks a turning point even though it looks small at the time. The capability vs. controllability frontier is useful here. Every increase in what an agent can do makes the agent harder to predict. Deferred loading is, in part, a response to that: by narrowing what the agent loads to what the request actually needs, you reduce the controllability problem to the surface area the user touched.

The category implication is that agents are drifting from 'product' (a thing you configure and use) toward 'program' (a thing you compose at runtime). Products have versions. Programs have behaviors. The OpenClaw or Hermes mental model, where you pick an agent off a shelf and deploy it, still works for simple cases. But for anything serious, the shelf metaphor is breaking down. What you're deploying is closer to a runtime than to a static configuration.

This has consequences for the vendor landscape. If agents are runtimes, the value migrates to whoever controls the runtime, not whoever ships the best preconfigured agent. The Harness Hypothesis applies: the value isn't in the model and increasingly isn't in the agent either; it's in the harness that decides what to load and when. Pydantic-AI just shipped a piece of that harness. Other frameworks will follow, because the user pressure is universal.

For the operator, the practical takeaway is to stop thinking of your agent stack as a set of finished products and start thinking of it as a set of building blocks that get composed per-request. That mental shift is the hard part. Once you've made it, the v1.105.0 release reads less like a Pydantic-AI feature and more like a category memo.

One more hedge: this could be wrong. It's possible that deferred loading remains a niche optimization used by performance-sensitive teams and the monolithic agent model persists for everyone else. The pattern across this week's releases suggests otherwise, but a single week is a single week. What's certain is that the question has been opened. How you answer it for your own stack is the next decision worth making carefully.

/Sources

/Key Takeaways

Pydantic-AI v1.105.0's deferred-loading feature lets agents load instructions, tools, model settings, and hooks at request time instead of startup, shifting agents from static products toward runtime compositions.
The same week, Agno, Phoenix, and LangGraph shipped releases pointing in the same direction: more models, dynamic per-context tooling, and runtime-composed remote graphs. The pressure is industry-wide, not vendor-specific.
Deferred loading reduces unconditional cost but increases conditional cost variance, complicates observability, and expands the attack surface around intent classification. The framework gives you the lever; you design the safety.
Migration order matters: start with agents that have clear separable modes, not your biggest most-complex one. Deferred hooks are missing hooks unless you test for their presence on every code path.
The category is drifting from 'agent as configured product' to 'agent as composed runtime.' Value accrues to whoever owns the harness that decides what to load and when.

Sources for this article

12 collected in pack · 6 cited & verified in body

This is the full source pack collected for the story — the pool the writer cites from, which is why the pack count can exceed the citations in the body. Tier labels reflect domain authority; freshness is re-checked daily. How each load-bearing claim bound to this pack is itemized in the claims panel below. What the tiers mean · How we verify.

Release v2.6.11 · agno-agi/agno
github.com
Community
Release e2b@2.27.1 · e2b-dev/E2B
github.com
Community
Release v3.178.0 · langfuse/langfuse
github.com
Community
Release @ai-sdk/amazon-bedrock@4.0.112 · vercel/ai
github.com
Community
Release v2.6.10 · agno-agi/agno
github.com
Community
Release v1.105.0 (2026-06-02) · pydantic/pydantic-ai
github.com
Community
Release v2.1.160 · anthropics/claude-code
github.com
Community
Release Release 1.34.2 · google/adk-python
github.com
Community
Release arize-phoenix: v16.5.0 · Arize-ai/phoenix
github.com
Community
Release langgraph==1.2.3 · langchain-ai/langgraph
github.com
Community
Release langgraph-sdk==0.4.2 · langchain-ai/langgraph
github.com
Community
Release arize-phoenix: v16.4.0 · Arize-ai/phoenix
github.com
Community

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

7 confirmed3 analysis

0/7 bound to a pack source

Confirmed
Pydantic-AI v1.105.0 shipped on June 2 with on-demand (deferred loading) capabilities covering instructions, tools, model settings, and hooks.
No matching pack item — claim recorded but not bound to a source.
Confirmed
The deferred-loading release also included Grok 4.3 reasoning_effort support alongside the headline feature.
No matching pack item — claim recorded but not bound to a source.
Confirmed
Agno v2.6.10 added four new model providers in a single release: Inception Labs, Xiaomi MiMo, MiniMax M2.7, and Cloudflare AI Gateway.
No matching pack item — claim recorded but not bound to a source.
Confirmed
Agno v2.6.11 added parallel web Task API and Monitor API tools plus a per-entity Manifest for AgentOS UI metadata.
No matching pack item — claim recorded but not bound to a source.
Confirmed
Phoenix v16.5.0 added rewind/fork/copy controls for chat messages and an annotate-spans skill for the platform's agent.
No matching pack item — claim recorded but not bound to a source.
Confirmed
Phoenix v16.4.0 made PXI quick actions dynamic per page context and added no-data chart overlays.
No matching pack item — claim recorded but not bound to a source.
Confirmed
LangGraph 1.2.3 wired RemoteGraph.interleave to sdk-py interleave_projections and added v3 streaming support to RemoteGraph.
No matching pack item — claim recorded but not bound to a source.
Analysis
The shape of recent framework releases implies agents are becoming runtime compositions rather than startup-configured objects.
Analysis
Deferred hooks introduce a class of bug where safety, logging, or rate-limiting can fail to load before they're needed.
Analysis
The category is shifting value from preconfigured agents toward the harness that decides what to load and when.

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.