Kolento Hou

@KolentoH

Building the New School for Human & AI @EmergencesLabs ( | Research in humanity @nyuniversity

Joined November 2023

166 Following

37 Followers

12 Posts

KolentoH retweeted

Fangfu Liu

@fangfu0830

14 days ago

🔥 We release Gamma-World from @nvidia — a generative multi-agent world model that finally goes beyond 2 players. ⚡ 24 FPS real-time streaming 🧩 Simplex Rotary Agent Encoding — permutation-symmetric 🌐 Sparse Hub Attention: O(N²) → O(N) 🎯 2 → more players, zero-shot 🤖 Games → real multi-robot worlds ━━━━━━━━━━━━━━━ 💥 THE SINGLE-AGENT ERA IS OVER. 💥 ━━━━━━━━━━━━━━━ 🔗 https://t.co/TuyX2d2XuT

451

312

402K

Kolento Hou

@KolentoH

16 days ago

https://t.co/81GPxljaXQ

23K

Kolento Hou

@KolentoH

29 days ago

Well-written piece! Makes me think of Tupaia, the Polynesian navigator who drew Cook a chart of the Pacific in 1769. Europeans spent two centuries misreading it, not because it was crude, but because it wasn't oriented to a pole. Turns out it's several local frames stitched together, each one anchored at a different island, bearings measured from wherever you set out. Every multi-agent system claiming "one world model" has a version of this. Classical SLAM mostly handles it: rigid transform, loop closure, done. The harder case is the one showing up with learned world models: two agents whose representations live in neural latent spaces that don't line up in any canonical way. Shared backbone ≠ shared world. There's a translation tax coming. Most people are pricing it at zero.

Kolento Hou

@KolentoH

about 2 months ago

Actually, radiologists have done this for a century. A CT isn't inherently viewable, you pick a window (bone, lung, brain) and the same data reveals different structures. Feels like prompts are becoming the windows. Makes me wonder if we'll re-derive windowing, multi-planar reconstruction, and structured reporting from scratch, or eventually borrow from how DICOM already thinks about this.

324

Kolento Hou

@KolentoH

about 2 months ago

AGI ≠ Multimodal "We dissect nature along lines laid down by our native languages. The categories we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds." — Benjamin Lee Whorf, Science and Linguistics, MIT Technology Review, 1940 Whorf was writing about language. But he was pointing at an older mistake: we confuse the way we carve up the world with the structure of the world itself. AI is making that mistake again. Today's frontier models — GPT, Claude, Gemini — are built on the same premise: slice the world into text, image, audio; build an encoder for each; stitch them back together. We call this multimodal. It isn't the structure of intelligence. It's an engineering convenience. That categories are human artifacts, not features of the world, has been shown everywhere outside AI. Color isn't a property of the universe; it's a projection of light onto a particular nervous system. Birds see four primaries, mantis shrimp see sixteen. "Red" is a category the human eye brings into the world, not one it finds there. Biologist Jakob von Uexküll made the broader point in 1934: every organism lives inside the world its senses carve out. Not a subset of reality, it's a construction. Philosopher Gilbert Ryle named the mistake called category error. Taking a cognitive tool and treating it as the thing it describes. When engineers say "we have a text modality, a vision modality, an audio modality", they're not describing intelligence. They're describing the grammar humans use to talk about it. Neuroscience has known this for decades. In people born blind, the visual cortex gets recruited by auditory and tactile input. "Visual area" isn't destiny — it's a neutral computational resource shaped by input. Modality differentiation emerges inside an integrated architecture; it isn't hardwired. And MIT's Phillip Isola team (Platonic Representation Hypothesis) showed that large models across different modalities and architectures quietly converge to a shared representational structure in their deeper layers. We cut the signal apart on the outside. The model stitches it back together on the inside. The entire multimodal paradigm is doing something absurdly inefficient. So whether AGI is near isn't a benchmark question. It's an older one: are we building an intelligence whose internal structure is no longer a copy of the grammar we use to describe the world? Whorf saw how humans mistake their own categories for nature's. We are now building a new kind of intelligence. It could have had the chance to carve the world in its own way. Instead, we are pressing our cuts into its structure, one by one. The question isn't when AI will surpass human intelligence. It's whether it will inherit the oldest mistake in human cognition: mistaking the way we see the world for the world itself.

KolentoH's tweet photo. AGI ≠ Multimodal

"We dissect nature along lines laid down by our native languages. The categories we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds." — Benjamin Lee Whorf, Science and Linguistics, MIT Technology Review, 1940

Whorf was writing about language. But he was pointing at an older mistake: we confuse the way we carve up the world with the structure of the world itself. AI is making that mistake again.

Today's frontier models — GPT, Claude, Gemini — are built on the same premise: slice the world into text, image, audio; build an encoder for each; stitch them back together. We call this multimodal.

It isn't the structure of intelligence. It's an engineering convenience. That categories are human artifacts, not features of the world, has been shown everywhere outside AI.

Color isn't a property of the universe; it's a projection of light onto a particular nervous system. Birds see four primaries, mantis shrimp see sixteen. "Red" is a category the human eye brings into the world, not one it finds there. Biologist Jakob von Uexküll made the broader point in 1934: every organism lives inside the world its senses carve out. Not a subset of reality, it's a construction.

Philosopher Gilbert Ryle named the mistake called category error. Taking a cognitive tool and treating it as the thing it describes. When engineers say "we have a text modality, a vision modality, an audio modality", they're not describing intelligence. They're describing the grammar humans use to talk about it.

Neuroscience has known this for decades. In people born blind, the visual cortex gets recruited by auditory and tactile input. "Visual area" isn't destiny — it's a neutral computational resource shaped by input. Modality differentiation emerges inside an integrated architecture; it isn't hardwired.

And MIT's Phillip Isola team (Platonic Representation Hypothesis) showed that large models across different modalities and architectures quietly converge to a shared representational structure in their deeper layers.

We cut the signal apart on the outside. The model stitches it back together on the inside. The entire multimodal paradigm is doing something absurdly inefficient. So whether AGI is near isn't a benchmark question. It's an older one: are we building an intelligence whose internal structure is no longer a copy of the grammar we use to describe the world?

Whorf saw how humans mistake their own categories for nature's. We are now building a new kind of intelligence. It could have had the chance to carve the world in its own way.

Instead, we are pressing our cuts into its structure, one by one. The question isn't when AI will surpass human intelligence. It's whether it will inherit the oldest mistake in human cognition: mistaking the way we see the world for the world itself.

229

KolentoH retweeted

Emergences Labs

@EmergencesLabs

about 2 months ago

Emergence Labs was at the Harvard China Education Symposium this weekend. Our team talked to 100+ people across education, AI, and everything in between. Huge thanks to everyone who came by the table. If we talked and didn't finish the conversation, DM us. Let's pick it back up.

EmergencesLabs's tweet photo. Emergence Labs was at the Harvard China Education Symposium this weekend.

Our team talked to 100+ people across education, AI, and everything in between.

Huge thanks to everyone who came by the table.

If we talked and didn't finish the conversation, DM us. Let's pick it back up.

242

Kolento Hou

@KolentoH

4 months ago

https://t.co/qz2Sh2zAJx

21K

Kolento Hou

@KolentoH

4 months ago

We spent 180 days evaluating whether the AI Agent can handle 1 day of our human work. We called it AgentIF-OneDay (https://t.co/eEmTiTENxs), the Evaluation Framework for End-to-End AI Agent Workflows The result? The best Agent today can only reliably complete 65%. 180 days → 24 hours. 4,320 hours → to verify 1 day. Sounds absurd. But that's where we are. ✅ AI can pass the bar exam ❌But fails to compare prices across three apps ✅ AI can solve competition math finals ❌ But plans a weekend trip full of holes ✅ AI can outperform 80% of humans on exams ❌ But struggles with real-world complexity Hard for humans? Easy for AI. Easy for humans? Hard for AI. Because passing exams ≠ getting things done. This time, our team and industry experts decided: ❌ Stop evaluating chatbots in simulated environments ✅ Start evaluating real Agents that users already use ❌ Stop evaluating exam questions with ground truth ✅ Start evaluating open-ended tasks in the real world We split a human's day into: Life / Study / Work, and split tasks into 3 types: 1️⃣ You know what to do, but execution is tedious 2️⃣ You don't know how, so only give a vague reference 3️⃣ You refine while doing, then figure it out as you go AgentIF-OneDay is just today's starting line: 1️⃣ 180 days → to evaluate 1 day (AgentIF-OneDay) 2️⃣ 30 days → to evaluate 1 week (AgentIF-OneWeek) 3️⃣ 1 day → to evaluate 1 year (AgentIF-OneYear) ❓End state → AI evaluates itself (AgentIF-Forever?) That's the day AI truly clocks in. 🤗 Hugging Face: https://t.co/HJ8niHlB5l 💻Github: https://t.co/i2YGdzjnkE 📰 Paper: https://t.co/eEmTiTENxs

KolentoH's tweet photo. We spent 180 days evaluating whether the AI Agent can handle 1 day of our human work.

We called it AgentIF-OneDay (https://t.co/eEmTiTENxs), the Evaluation Framework for End-to-End AI Agent Workflows

The result? The best Agent today can only reliably complete 65%.

180 days → 24 hours. 4,320 hours → to verify 1 day. Sounds absurd. But that's where we are.

✅ AI can pass the bar exam
❌But fails to compare prices across three apps
✅ AI can solve competition math finals
❌ But plans a weekend trip full of holes
✅ AI can outperform 80% of humans on exams
❌ But struggles with real-world complexity

Hard for humans? Easy for AI.
Easy for humans? Hard for AI.
Because passing exams ≠ getting things done.

This time, our team and industry experts decided:
❌ Stop evaluating chatbots in simulated environments
✅ Start evaluating real Agents that users already use
❌ Stop evaluating exam questions with ground truth
✅ Start evaluating open-ended tasks in the real world

We split a human's day into: Life / Study / Work, and split tasks into 3 types:
1️⃣ You know what to do, but execution is tedious
2️⃣ You don't know how, so only give a vague reference
3️⃣ You refine while doing, then figure it out as you go

AgentIF-OneDay is just today's starting line:
1️⃣ 180 days → to evaluate 1 day (AgentIF-OneDay)
2️⃣ 30 days → to evaluate 1 week (AgentIF-OneWeek)
3️⃣ 1 day → to evaluate 1 year (AgentIF-OneYear)
❓End state → AI evaluates itself (AgentIF-Forever?)

That's the day AI truly clocks in.

🤗 Hugging Face: https://t.co/HJ8niHlB5l
💻Github: https://t.co/i2YGdzjnkE
📰 Paper: https://t.co/eEmTiTENxs

531

Kolento Hou

@KolentoH

4 months ago

https://t.co/BYy1zGN1JY

60K

Kolento Hou

@KolentoH

Last Seen Users on Sotwe

Trends for you

Most Popular Users