Benchmarks place GPT 5.5 as the best model on SWE, but is it the best at making apps end-to-end?
Turns out Opus 4.8 continues to be the king of vibe coding on both price & performance.
Introducing ViBench: the first benchmark for app creation based on real world tasks
4,226 voices have already been recorded across 147 Humans in AI Week events happening around the world this week.
I do not think I would have believed that three years ago.
In early 2023, a small group of us were sitting in a cozy San Francisco apartment trying to make sense of ChatGPT, AGI, and what this all meant for the world.
The technology felt historic. The thing I remember most was the feeling in the room.
People were thinking out loud. Changing their minds. Admitting confusion. Getting excited. Getting scared. Finding language together.
That night became @AICollectiveCo.
Three years later, this community has grown to 250,000+ members, 200+ chapters, 1,700+ events, and 650+ volunteer organizers around the world.
This week, that same instinct is becoming Humans in AI Week: June 1–7, 2026, across 100+ cities and 50+ countries.
A global time capsule for the AI era.
The question is simple:
What does it mean to be human in the AI era?
The answers are messy. That is the point.
Some people are excited. Some are scared. Some feel behind. Some feel superpowered. Most of us are carrying a few contradictory feelings at once.
That is why I still believe so much in rooms full of real people.
Online, AI discourse collapses into extremes. In person, the temperature changes. People listen longer. The anxious parts get named with more care. The optimistic parts become less abstract. You remember there is a person behind every position.
That has always been the magic of this community.
We can gather frontier builders, curious newcomers, artists, students, founders, policy people, educators, and skeptics into the same conversation, then let the room do what the internet usually cannot: slow people down enough to hear each other.
Humans in AI Week feels like the culmination of 3+ years of learning how to create those rooms.
For me personally, this is also a handoff moment. I stepped back from leading The AI Collective because the community had become bigger than any one person, and because AJ, Catherine, and the team were ready to carry it into its next chapter. Watching them turn the original spark into something this global is SO special.
Deeply grateful to @AJs_AI, @catrosemcmillan, our chapter leads, our volunteers, our partners, and every person showing up or adding their voice this week.
This is what we built the community for.
Onwards and upwards!! 🚀
Something I've seen in many Chinese LLMs is they generate sentences with mixed Chinese and English. First time I've seen this in Claude, which makes me think they've been training on (advertently or inadvertently) Chinese model output.
真=real
My addition to this RLM discourse is this reply from Omar in Jan
At the time the paper came out some of us were questioning whether RLMs were just what harnesses like Claude Code were already doing but Omar had an important point that they were missing recursion. With Dynamic Workflows Claude Code finally has built in recursion, albeit only one level. That's great news for us practitioners because we can now coin new terms like RLM 2.0, GraphRLM, Hybrid RLM, Agentic RLM and so on.
@yi_ding@alex__mackenzie@a1zhang Totally. The PLAN․md pattern + coding environment + recursion are all you need to have a complete RLM!
Turns out that these three pieces together give you a extremely general and strong inference scaling axis for handling what appear to be arbitrarily long prompts. That's all!
@tomek_builds For sure, but if you look at the trajectory of Deep Research we went from "OMG how did they do that?" to "of course ChatGPT has that" in a year or maybe even less?
So the super impressive thing about dynamic workflows that people are sleeping on is that it isn't "deterministic." It's literally just a prompt, albeit a fairly detailed one, teaching the agent to write a graph-like description in Javascript.
A lot of people thought this kind of thing would be possible 3 years ago, but were too early. The feature currently has a lot of manual tuning (in the same way Deep Research did when it was first released), but it's still super impressive to see the dream become a reality.
someone hit me up about the new "claude dynamic workflows" feature, claiming "see, multi-agent works"
But really, the launch of this feature proves the exact point that I made back in June of 2025, along with @walden_yan, @tobi, @karpathy, and many others:
Deterministic workflows orchestrating small agent loops beats non-deterministic multi-agent or "agent soup" systems every dang time
everything is context engineering
For sure, some of that is super speculative/way too early.
But a year or two ago I couldn't even trust an LLM to output a complicated JSON with 90%+ reliability, so if you told me that by mid-2026 you could have agents released to the general public output write own multi-agent workflows and actually accomplish a large proportion of the tasks they're given I'd have laughed in your face.
So the super impressive thing about dynamic workflows that people are sleeping on is that it isn't "deterministic." It's literally just a prompt, albeit a fairly detailed one, teaching the agent to write a graph-like description in Javascript.
A lot of people thought this kind of thing would be possible 3 years ago, but were too early. The feature currently has a lot of manual tuning (in the same way Deep Research did when it was first released), but it's still super impressive to see the dream become a reality.
So yeah, the overall flow/structure can't be changed once the workflow is sent to the tool, but the whole "dynamic" vs. "regular" part is that actual agents themselves aren't predetermined when the Opus decides to build a dynamic workflow.
And of course the calling LLM and user can create more workflows in response to the results, so it's not complete freedom, but it's a lot more degrees of freedom than pretty much anything I've seen work well in prod before.
I remember hearing @karpathy himself say that autonomous agents would, like self driving cars, take a decade to work at human-like levels. To see the Claude Code team deliver it in 3 years is both mind-blowing and just a testament to the continued exponential trajectory.
I remember hearing @karpathy himself say that autonomous agents would, like self driving cars, take a decade to work at human-like levels. To see the Claude Code team deliver it in 3 years is both mind-blowing and just a testament to the continued exponential trajectory.
Super excited to finally share Dynamic Workflows in Claude Code!!
We built this a couple months ago, and it has slowly become a daily driver for a bunch of people at Anthropic. A few tips for getting the most out of it 🧵
https://t.co/WtwkSd3JPp
It looks like workflows are more powerful thank I thought. Dynamic workflows allow workflows to be loops, etc. Which is probably why they're defined as JS rather than just a JSON DAG or something.