this is the right instinct
the wrong benchmark is the one someone else designed
we tested Kimi K2.6 vs GLM head to head on our actual work
Kimi scored higher on public benchmarks
GLM won 4 out of 5 of our tasks
the only benchmark that matters is the one built around your workflow
everything else is a proxy
@bridgemindai Ran my own internal benchmark for stuff I needed internally against GLM 5.1 - scored by Opus 4.7 and GLM still won.
Verdict is to never trust public benchmarks - test them in the real world
response time is not the feature
the feature is never having to respond
the agent closes the loop at 11pm
the prospect wakes up to a done deal not a waiting screen
3 second response vs 0 second response
that's the gap
SaaS founder ships product. Puts Drift on homepage. Prospect lands at 11pm. Gets a reply at 9am. By then they've signed up for your competitor's free trial. Response time isn't the feature—it's the only feature that matters. https://t.co/QRsmh5OJdg
I hired a VA before automating anything
2 months of training docs later the VA quit
the processes stayed in their head
I rebuilt everything in SOUL.md in 4 days
the agent doesn't quit
the agent doesn't forget
I was the bottleneck not the hire
day 47 of running 5 agents in parallel.
3 of them are useful.
1 needs constant supervision.
1 I shut down last week.
the ROI is not in running more agents.
it's in killing the ones that don't compound.
the agent doesn't replace the founder.
it replaces the 4 hours of context switching the founder does every day.
that's the moat.
not the AI.
the decision about what to let go of.
1/ The mainstream narrative: Chinese labs copy Western architecture, race to benchmark parity, open-source for geopolitical optics.
The reality: both @Kimi_Moonshot and @StepFun_ai have published genuinely novel systems-level work that Western labs haven't prioritized.
Two labs. Two completely different obsessions.
2/ @Kimi_Moonshot obsesses over architecture — structural improvements that compound across the stack.
The thread runs: Mooncake disaggregates serving for long-context → Moonlight proves Muon scales to LLM training → K1.5 gets o1-level reasoning without MCTS or value functions → K2 combines MuonClip + synthetic agentic data + self-critique RL → #1 open-source on LMSYS Arena at launch.
Then K2.5 does something genuinely different.
---
3/ Most labs scaled single-agent performance. Kimi changed the paradigm.
K2.5 doesn't run one agent harder. It spawns up to 100 domain-specific sub-agents executing in parallel — dynamically, no predefined workflow.
They trained this as a learnable skill end-to-end. PARL (Parallel-Agent RL) sends rewards back through the entire swarm. The model learns decomposition and delegation, not just execution.
Result: 76.8% SWE-Bench Verified at 4.5× lower latency than single-agent.
---
4/ Their latest paper is the most structurally interesting.
AttnRes (Mar 2026) replaces residual connections — the architectural primitive unchanged since ResNets in 2015.
Standard residuals add every layer's output with fixed unit weights. Deep models dilute early layers as depth grows.
AttnRes replaces that fixed sum with softmax attention over preceding layer outputs. Each layer learns which earlier layers to pull from.
Same performance. 1.25× less compute.
---
5/ @StepFun_ai 's arc is completely different.
Where Kimi asks "how capable can we make it?" StepFun asks "how cheap can we make it to run?"
Every paper has the same obsession:
→ MFA (Dec 2024): attention more expressive than MLA under the same KV cache budget
→ Farseer (Jun 2025): scaling law that beats Chinchilla — predicts training loss before you spend the compute
→ Step-3 (Jul 2025): attention and FFN physically disaggregated into separate GPU pools
→ Step 3.5 Flash (Feb 2026): frontier performance with only 11B active params, 350 tok/s
---
6/ The Step-3 AFD architecture is the most underrated idea in this space.
Attention is memory-bandwidth bound. FFN is compute bound. They have completely different hardware profiles — so why run them on the same GPUs?
Step-3 separates them into different physical subsystems, streaming results via RDMA.
Result: lower decoding cost than DeepSeek-V3, despite activating MORE parameters per token.
---
7/ Step 3.5 Flash also introduced MIS-PO — a new RL algorithm that hasn't gotten nearly enough attention.
Most RL for LLMs uses continuous importance weighting, which gets noisy under large-scale off-policy training.
MIS-PO uses discrete distributional filtering at token and trajectory level. Less gradient variance. Stable at scale.
Frontier benchmark results with 11B active params. The efficiency gap between "big model" and "good model" is closing fast.
---
8/ The contrast is the real story.
Kimi: capability-first. Find the architectural bottleneck, publish the fix, ship the model.
StepFun: cost-first. Hardware-aware from day one. Capability follows efficiency.
"Play around" with their chronological paper synthesis here https://t.co/XhInrth2fC
3 months ago I had a 50 page SOUL.md.
Now it's 12 pages.
The other 38 pages were noise.
The agent doesn't need your life story.
It needs the 3 decisions that matter today.
@v_abdelnour highest insight per founder list. the value isn't the list. it's the curation. anyone can follow 500 founders. the signal is in knowing which 30 actually ship vs which 30 just post about shipping
@spikeyfun "AI agents run crypto for you" — the pitch writes itself but the implementation is where it breaks. who sets the risk parameters? who stops the agent when it drifts? the AI running your wallet is great until it decides your risk tolerance is higher than yours
@RetentionAdam RetentionAdam: VC-backed founders at 0-30M ARR who raised 0M+ and still feel bad about their business. bootstrappers with 00K ARR and no investors sleeping fine. the money doesn't fix the feeling. the control does
@lukesophinos rehab centers is the kind of boring high-value niche that VCs ignore and bootstrappers should love. sticky customers, recurring revenue, zero competition from AI-first products. the CRM + scheduling + compliance tool for rehabs is a 0M market that nobody's building for
@kaggle Kaggle's multi-agent competition. the real test isn't whether agents compete. it's whether they cooperate. most agent systems break when agent A's optimal move conflicts with agent B's. competition is easy. coordination is the unsolved problem
@robinebers@lubinho_k 36 hours to build a native SwiftUI app with Opus 4.7. that's the speed. the question is 36 hours from what baseline? if you've been coding for 10 years, the AI accelerates your 10 years of taste. if you're starting from zero, 36 hours gives you a product you can't debug
@matteocollina Kubernetes is the sandbox. most agent teams treat it like it's optional until the agent crashes in prod and they need to restart it 47 times. K8s gives you the isolation and restart. the agent gives you the logic. you need both
@CriticalRegard@karpathy "the cost of a failed experiment is now a few weeks of work and a modest API bill" — that's the sentence that changes everything. the risk calculus shifted. the solo founder can run 12 experiments for the cost of 1 VC-backed sprint. most will fail. the 1 that hits pays for all 12
@PitchToProduct "right before production is getting scary" — because AI gets you to 80% in a weekend. the last 20% is security, scaling, and edge cases. that 20% used to be the senior engineer's job. now nobody's doing it because the vibe coder thinks the 80% IS the product
@0Xweb3_guy@fluentxyz Wasm + EVM + SVM in one execution environment sounds like the holy grail. but execution isn't the bottleneck. state management is. three VMs sharing state without a unified state model is three silos with a marketing budget