Computer Scientist. Principal Architect at Salesforce AI Applications Platform. ECMA TC39 Member, Former Editor of ECMA402. Yahoo! Alumni & Former YUI Core Team
i've been posting about what agent-composed UI looks like, but you have to see it and feel it in practice. it's very much experimental, but give it a shot and tell me what you see.
the setup: you take any MCP server you already have wired up in your host (a catalog, a CRM connector, a ticketing system, whatever returns structured data) and you add `@mcp-lens/server` via npx connector alongside it. this gives the agent a tool for composing visual responses and some lightweight guidance on how to use it. that's it. you don't change anything about your existing server. no new code, no widget authoring, no renderer to build.
this is the part worth experiencing firsthand. ask the same questions you normally would, and let the agent pull data from your existing server the same way it always did. if it answers in plain text, tell it "use show_lens when possible" and it picks up the pattern for the rest of the session. from there it should compose a bounded visual response sized for the conversational moment. a card when you asked about one thing. a list with per-row buttons when you're browsing. a comparison table when you're choosing between two. next-turn affordances on every response so you advance by clicking instead of typing. and it shifts shape turn by turn because the agent is reading the full conversation and deciding what fits each moment.
that's it, just npx it into your config `{ "mcpServers": { "lens": { "command": "npx", "args": ["-y", "@mcp-lens/server"] } } }` next to something you already use and see what comes out.
i think #MCPUI is going to end up serving a much smaller role than its current positioning suggests. not because it doesn't work, it does, but because the class of conversational moments that genuinely need a custom, tool-authored widget is narrow, and it's about to get narrower.
the majority of what users ask about in agent conversations is structured data: records, lists, comparisons, statuses, confirmations. these are well served by a constrained vocabulary the superagent composes per turn. you don't need a custom widget for "show me the account" or "compare these two incidents" or "what's the status of my order." you need the agent to know what moment the user is in and compose the right bounded affordances from that data. that's what Visualizers and agent-composed approaches do, and they do it without the tool author shipping anything beyond the data itself.
where #MCPUI still earns its place: genuinely interactive experiences that go beyond what bounded affordances can express. complex data visualizations with pan/zoom, custom editors, media players, anything that needs persistent client-side state within a single turn. those exist. they're real use cases. they're also not the common case in most conversational workflows.
i think within a year, agent-composed UI from structured output becomes the default path for most conversational moments, and #MCPUI becomes the escape hatch you reach for when the constrained vocabulary isn't enough. useful, supported, available, but not the primary way most tool responses end up rendered.
here's a small example that exposes a structural limit of #MCPUI widgets. the user says "i don't like emojis" partway through a conversation, or "stop showing me pricing," or "i prefer compact views." from that point forward, every response should honor that preference.
if the response is plain text, the agent just adapts. it heard the preference, it remembers it, it writes differently from now on. done. if the response is an agent-composed UI from a constrained vocabulary, same thing — the agent composes the next view without emojis, without pricing, more compact. it already has the preference in context; nothing else has to change.
but if the response is a pre-built MCPUI widget, that preference has nowhere to go. the widget was authored months ago by someone who never anticipated "i don't like emojis" as a runtime condition. the widget renders what it renders. the user said something, and the UI can't hear it.
the escape hatches people reach for: maybe add an input parameter to the tool ("includeEmojis: false") so the widget can read it. now multiply that by every preference a user might express across a session — compact views, hide certain fields, language tone, level of detail — and the tool's input schema becomes a growing bag of display config that has nothing to do with the tool's actual job. or maybe the superagent passes all accumulated preferences into every tool call as extra context, and the widget reads them from the args. now the superagent has to maintain and serialize a preference blob on every single call, and the widget has to parse and honor an open-ended set of display rules it was never designed for. both escape hatches work in theory. both add complexity that compounds with every new preference the user might express.
the architecture where the superagent composes the UI per turn doesn't face this. the agent heard "i don't like emojis" the same way it heard everything else in the conversation. the next time it composes a view, it leaves out the emojis. no config parameter, no preference blob, no widget modification. the preference is just part of the context the agent is already reading when it decides what to render.
i think the cheapest, highest-leverage UX primitive in conversational agents is a small set of bounded affordances at the end of every meaningful response. two to four next-turn buttons, sized for the moment, and the user advances by clicking instead of typing.
mechanically, this is the same job the model is already doing for text. every text response is the model predicting what comes next given the conversation context. bounded affordances are the same prediction, just rendered as a few clickable options instead of a paragraph. the model already knows what the user is likely to want next, it demonstrates that by writing coherent follow-up sentences. making those predictions explicit and clickable is almost free in terms of what the model has to do differently; it's just a different output format for the same underlying work.
the missed-turn failure mode is when the agent answers about structured data in plain prose and sends the user back to the keyboard. the agent had a chance to predict the next two or three moves and make them one click away. instead it wrote a paragraph and left the user to figure out what to type next. multiply that across a session and you get a UX that feels like a search bar dressed up as a chat, technically conversational, practically stalled after every answer.
the rule i've ended up with: if the response was about a specific thing, end with two to four follow-ups that name what the user might do next with that thing. never zero, rarely one. anything more starts feeling like a navigation menu, which is a different shape for a different medium.
i've been thinking about what conversational UX does to frontend craft, and the shift i keep landing on is from designing widgets to designing vocabularies. fewer pixels, more structural decisions. the work moves up a level.
what that looks like concretely: instead of choosing a shape for "the order detail page," you choose which node types belong in the spec, which conversational moments earn their own affordance set on a given domain, and what precedent the agent should read to compose the moment well. the agent does the per-turn composition. the vocabulary and the precedent are still authored, but they're authored once and then reused across every moment that fits. the leverage is much higher per hour of design than building one bespoke widget per data shape.
the part of this that keeps me genuinely excited (not in a marketing way, in a "this is interesting work" way) is that it's a real frontend discipline. it's not prompt engineering, it's not back-end work, it's not visual design exactly. it's grammar design. choosing a small set of structural primitives and a precedent library that lets a constrained agent produce something specific to each moment without going off-script. less component, more grammar. less SPA, more conversation. that's where i think the craft is heading IMO.
the test i keep applying to #MCPUI widgets is whether i can ask the agent to do the same thing by typing. often they fail it, and i think that failure is the clearest signal that the widget has stopped being part of the conversation.
here's why it matters. a button that fires a tool call directly bypasses the agent. the user can click it but can't say it. that means the agent is no longer the canonical way to drive the surface; the widget is. the conversation is now a thin shell wrapped around a UI, instead of a UI nested inside a conversation. once that's true, the button isn't an affordance the agent offered, it's a side door. and side doors don't have typeable equivalents because the agent doesn't know how to call that MCP tool with the right values to carry on the same action.
the fix is simple in shape, harder in practice: every button on a widget should emit a prompt to the agent, never a direct tool call. clicking should feel like the user typed it. the agent processes the prompt the same way regardless of whether it came from the keyboard or from a click. I call this a reflective widget: every action it offers, the user could have asked for in words. and that's what keeps it inside the conversation instead of replacing it.
I feel that conversational #UX is converging on a single architecture for inline UI: the superagent itself composes the view per turn from a constrained vocabulary, not the tool author and not a secondary agent. that convergence has a structural reason, and once i saw why, i couldn't unsee it.
predicting next-token-given-context is what these models are already best at. they do it in plain text on every turn. that's the entire job. a constrained vocabulary, a small set of node types and shapes the model can compose, lets them do that exact same job to produce a nicer UI that is easier for human to interpret and interact with. they're not generating novel UI artifacts; they're predicting which next-turn affordances fit this moment (via actionable buttons), given everything the conversation has been about. the constraint is what makes the output bounded; the model's existing strength is what makes the fit right.
anthropic's Visualizer is the visible early example of this. i'd bet the other superagents ship the same shape within the year. the architecture converges here not because the spec authors are coordinating, but because the architecture plays to what these models are already good at. the alternatives (tool-author shipping #MCPUI pre-composed widgets, secondary agent generating UI without conversation context) ask the system to do something it's structurally weaker at. composing without context is harder than predicting from context. so the design space sorts itself out.
when I put my #i18n hat and look at #MCPUI, i wonder how inline widgets in conversational agents are supposed to handle non-english users, and the more i look at it the more i think nobody has actually answered the obvious questions.
the part that makes this interesting is what widgets are doing under the hood. when a user clicks a button, makes a selection, or otherwise interacts with one of these inline widgets, the widget must inform the superagent with a sentence describing what happened, and the agent processes that sentence as if the user had typed it, otherwise you can't have a coherent conversation. that's how a click becomes a turn in the conversation. so a widget isn't just a localized rendering problem, it's also a generator of natural-language inputs that the agent has to read.
the obvious questions: the spec carries a locale field through the iframe handshake and/or document.documentElement.lang, so the widget can know what language to render in. fine. but that's the locale of the host app, not necessarily the language of the conversation, most of the time those are the same, but it's a quiet assumption. and what about the prompt the widget sends back to the agent when the user clicks something? it can technically be in any language; the agent will probably understand english prompts inside a spanish conversation. but should it? if the widget renders in spanish and informs the agent in english about interactions because the author only wrote english ones, the conversation gets a strange split-language seam. and what about the case where the widget can't render in the user's language at all because the author didn't translate it? no documented fallback, no convention, no guardrail. the answer is: whatever the widget author decided, if they decided anything.
it's worth pausing on how well plain text responses already handle this. when the agent answers in prose, it reads the conversation, picks the language the user has been speaking, infers the right framing, and writes back. coherent end to end. no translation tables, no locale detection logic, no split-language seams. the entity producing the response is the same entity reading the conversation, so localization happens for free. widgets break this loop by inserting a third author (the widget code, written months ago, by someone who never saw this conversation) in the middle.
the contrast worth drawing is with the architecture where the superagent composes the UI per turn from a constrained vocabulary. the agent has been conversing with the user in their language the entire session. when it composes a button, both the label and the prompt come out in that language naturally. no translation tables, no locale detection, no fallback strategy when the widget doesn't speak the user's language. localization isn't a feature you bolt on, it's a property the architecture has for free when the entity composing the UI is the same entity that has the conversation context.
something that took me a while to internalize: in conversational UX, the right unit of design isn't the data shape, it's the moment. the same data drives different views depending on what the user is actually doing in that turn. "i just asked about this one thing" wants a single anchor with a few next-turn affordances. "i'm browsing a few" wants a list with per-row buttons. "i'm choosing between two" wants a comparison. "i want to confirm" wants a tight dialog. same data, different artifacts.
the consequence is bigger than it looks. the data the superagent fetches to compose each one is also different. the moment doesn't just shape the layout, it shapes the query. a confirmation needs a quantity and a consequence; a browse needs a small page of items; a comparison needs the two specific records the user named, plus the few fields that actually matter for choosing between them. a single widget per data shape — a la #MCPUI — can't handle this flexibility, because the moment isn't in the data. the superagent has to read the conversation to know what to fetch and how to compose it.
that's where the work goes IMO. not in shipping more tools with widgets, in modeling more conversational moments well.
building widgets on #MCPUI today is rough, and i want to be precise about why because it's easy to lump it together with something it isn't. plain #MCP is fine. the protocol, the SDK, mcp-inspector, the stdio transport, server-side logs, all of that works. you can build and debug a regular MCP server today without much friction. the roughness is in MCP UI specifically, the iframe + bridge + widget layer.
what makes that part rough is partly the spec drift between hosts (different bridge surfaces, validators that disagree with the spec doc, opaque host-side errors), but it's more than that. superapps like Claude and Codex don't ship a developer mode. there's no way that I have found to reliable inspect the iframe's DOM, console, or postMessage traffic the way you would in a browser. when something goes wrong inside your widget, all you have is whatever logs the app decides to surface, which is barely anything. you end up writing your own console-log breadcrumbs into the iframe and hoping the host pipes them somewhere visible.
the second-order observation is what i find interesting. if you can't realistically debug a custom widget in production, building one yourself is a much worse bet than it looks at first. the rational move is to not own the widget at all. let the superagent compose from a constrained, vetted vocabulary the host already understands and renders. the agent has the context to compose well, the host has the rendering pipeline, and you don't end up debugging an opaque iframe in a host that won't tell you why it broke. the DX gap isn't just an inconvenience, it's another reason the architecture is sorting itself toward thin layers and agent-composed UI rather than rich self-built widgets.
to this point, left: roughly what a record page looks like when the tool decides without conversation context. a single record, full schema dump, tabs, related lists. right: a different kind of artifact. multiple data points pulled together for the moment, an anchor, the fields that matter, and a small set of next-turn affordances. only the superagent has the right context to compose this.
the question conversational UX is actually asking is who composes the UI in a conversation. three answers are in flight: the tool author at build time (current MCPUI), a secondary agent at request time (agent as a tool), or the superagent itself (mcp client), per turn, from a constrained vocabulary. the first two are structurally wrong. only the superagent has the conversation context to know what to render. IMO that's where this is going.
the second problem with MCPUI is structural and not containable by discipline. even if every widget is perfectly chat-shaped, one widget per tool can't serve a user who is asking all kinds of questions about the same data. the user might be browsing, comparing, choosing, confirming, drilling into one aspect, all in different turns. the widget doesn't know which. and the superagent, the only thing in the loop with full conversation context, has no role in shaping the response. it calls a tool, gets a static surface back, and the agent's understanding of what the user is actually trying to do is wasted at exactly the layer where dynamism has to come from.
the first trap i've seen frontend devs walk into with MCPUI is reaching for SPA patterns. tabs, persistent state, navigation menus, dashboards. it lands as familiar shape and feels productive for a few widgets. then you look up and you're building an SPA inside a chat surface. the unit of design in a conversation is a turn, not a page. this trap is real but perhaps it's more manageable with conventions and constrained widgets. it's a discipline problem.
A lot of people think that vibe coding tools are going to let a lot of non-coders write software.
That might happen, but the more interesting story is that these tools will turn a lot of non-coders into badass coders _because_ they could write software.
🤔What happens when you put @marksammiller and @ESYudkowsky in the same room?
A deep dive into AI, existential risk, and whether alignment is even possible!
Thank you to @foresightinst for the video ↓
@tobi@PerplexityComet browsers are the superagents of the future! mcp client + local secure model to support that mcp client, they will be able to do wonders!
@slicknet I call this technique “instruction based tools”! the challenge is that different models behave differently, and there is no way to tell whether or not it is going to have the intended result! Nevertheless, this is extremely powerful!
@dalmaer maybe if you control the full stack and every agent on it you can get a good enough system (i see some similarities with micro services in that sense), but they are not match to a super agent IMO