@leerob the final answer is too verbose - 5.5 does a great job at making it very to the point, whereas composer provides an essay w tables and lists and big headers. i always have to scroll around to understand the final answer
Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
In Vending-Bench Arena (the multiplayer version of Vending-Bench with competition dynamics), GPT-5.5 actually beats Opus 4.7.
Opus 4.7 showed similar behavior to Opus 4.6: lying to suppliers and stiffing customers on refunds. GPT-5.5's tactics were clean, and it still won.
JUST IN: ELON MUSK IS EXPLORING A THREE-WAY AI PARTNERSHIP BETWEEN xAI, CURSOR, AND FRENCH AI STARTUP MISTRAL
Per Business Insider, SpaceX also announced a deal this week giving it the option to buy Cursor for $60 billion later this year.
The goal: close the gap with Anthropic and OpenAI in AI coding and agents.
Context:
• Cursor is already training its model on xAI infrastructure
• Mistral co-founder Devendra Chaplot joined xAI last month to lead pretraining
• xAI currently runs ~200,000 NVIDIA GPUs and plans to scale to 1 million
Musk has publicly called Anthropic's models "misanthropic and evil."
Meet Kimi K2.6: Advancing Open-Source Coding
🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2)
What's new:
🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization).
🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D.
🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files.
🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops.
🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop.
-
K2.6 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
For production-grade coding, pair K2.6 with Kimi Code: https://t.co/uvoSJKyGCY
-
🔗 API: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/9wWvgIQSS3
🔗 Weights & code: https://t.co/Be0hjs2RTP
coding w ai is solved bc all context is in the git repo
knowledge work is difficult bc context is spread out
an ai system that creates a git repo w all context for a knowledge worker will be able to 100% automate the work
coding w ai is solved bc all context is in the git repo
knowledge work is difficult bc context is spread out
an ai system that creates a git repo w all context for a knowledge worker will be able to 100% automate the work
Code Review optimizes for depth and may be more expensive than other solutions, like our open source GitHub Action.
Reviews generally average $15–25, billed on token usage, and they scale based on PR complexity.
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵