Well-written piece! Makes me think of Tupaia, the Polynesian navigator who drew Cook a chart of the Pacific in 1769. Europeans spent two centuries misreading it, not because it was crude, but because it wasn't oriented to a pole. Turns out it's several local frames stitched together, each one anchored at a different island, bearings measured from wherever you set out.
Every multi-agent system claiming "one world model" has a version of this. Classical SLAM mostly handles it: rigid transform, loop closure, done. The harder case is the one showing up with learned world models: two agents whose representations live in neural latent spaces that don't line up in any canonical way.
Shared backbone β shared world. There's a translation tax coming. Most people are pricing it at zero.
Actually, radiologists have done this for a century. A CT isn't inherently viewable, you pick a window (bone, lung, brain) and the same data reveals different structures. Feels like prompts are becoming the windows.
Makes me wonder if we'll re-derive windowing, multi-planar reconstruction, and structured reporting from scratch, or eventually borrow from how DICOM already thinks about this.
AGI β Multimodal
"We dissect nature along lines laid down by our native languages. The categories we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds." β Benjamin Lee Whorf, Science and Linguistics, MIT Technology Review, 1940
Whorf was writing about language. But he was pointing at an older mistake: we confuse the way we carve up the world with the structure of the world itself. AI is making that mistake again.
Today's frontier models β GPT, Claude, Gemini β are built on the same premise: slice the world into text, image, audio; build an encoder for each; stitch them back together. We call this multimodal.
It isn't the structure of intelligence. It's an engineering convenience. That categories are human artifacts, not features of the world, has been shown everywhere outside AI.
Color isn't a property of the universe; it's a projection of light onto a particular nervous system. Birds see four primaries, mantis shrimp see sixteen. "Red" is a category the human eye brings into the world, not one it finds there. Biologist Jakob von UexkΓΌll made the broader point in 1934: every organism lives inside the world its senses carve out. Not a subset of reality, it's a construction.
Philosopher Gilbert Ryle named the mistake called category error. Taking a cognitive tool and treating it as the thing it describes. When engineers say "we have a text modality, a vision modality, an audio modality", they're not describing intelligence. They're describing the grammar humans use to talk about it.
Neuroscience has known this for decades. In people born blind, the visual cortex gets recruited by auditory and tactile input. "Visual area" isn't destiny β it's a neutral computational resource shaped by input. Modality differentiation emerges inside an integrated architecture; it isn't hardwired.
And MIT's Phillip Isola team (Platonic Representation Hypothesis) showed that large models across different modalities and architectures quietly converge to a shared representational structure in their deeper layers.
We cut the signal apart on the outside. The model stitches it back together on the inside. The entire multimodal paradigm is doing something absurdly inefficient. So whether AGI is near isn't a benchmark question. It's an older one: are we building an intelligence whose internal structure is no longer a copy of the grammar we use to describe the world?
Whorf saw how humans mistake their own categories for nature's. We are now building a new kind of intelligence. It could have had the chance to carve the world in its own way.
Instead, we are pressing our cuts into its structure, one by one. The question isn't when AI will surpass human intelligence. It's whether it will inherit the oldest mistake in human cognition: mistaking the way we see the world for the world itself.
Emergence Labs was at the Harvard China Education Symposium this weekend.
Our team talked to 100+ people across education, AI, and everything in between.
Huge thanks to everyone who came by the table.
If we talked and didn't finish the conversation, DM us. Let's pick it back up.
We spent 180 days evaluating whether the AI Agent can handle 1 day of our human work.
We called it AgentIF-OneDay (https://t.co/eEmTiTENxs), the Evaluation Framework for End-to-End AI Agent Workflows
The result? The best Agent today can only reliably complete 65%.
180 days β 24 hours. 4,320 hours β to verify 1 day. Sounds absurd. But that's where we are.
β AI can pass the bar exam
βBut fails to compare prices across three apps
β AI can solve competition math finals
β But plans a weekend trip full of holes
β AI can outperform 80% of humans on exams
β But struggles with real-world complexity
Hard for humans? Easy for AI.
Easy for humans? Hard for AI.
Because passing exams β getting things done.
This time, our team and industry experts decided:
β Stop evaluating chatbots in simulated environments
β Start evaluating real Agents that users already use
β Stop evaluating exam questions with ground truth
β Start evaluating open-ended tasks in the real world
We split a human's day into: Life / Study / Work, and split tasks into 3 types:
1οΈβ£ You know what to do, but execution is tedious
2οΈβ£ You don't know how, so only give a vague reference
3οΈβ£ You refine while doing, then figure it out as you go
AgentIF-OneDay is just today's starting line:
1οΈβ£ 180 days β to evaluate 1 day (AgentIF-OneDay)
2οΈβ£ 30 days β to evaluate 1 week (AgentIF-OneWeek)
3οΈβ£ 1 day β to evaluate 1 year (AgentIF-OneYear)
βEnd state β AI evaluates itself (AgentIF-Forever?)
That's the day AI truly clocks in.
π€ Hugging Face: https://t.co/HJ8niHlB5l
π»Github: https://t.co/i2YGdzjnkE
π° Paper: https://t.co/eEmTiTENxs