๐๐๐ฟ๐ฎ๐๐ผ๐ฟ ๐ผ๐ณ ๐๐ต๐ฒ ๐๐๐ ๐ฝ๐ฒ๐ฑ๐ถ๐ฎ
Sharing insights from the most interesting AI papers ยทหโถ โ.โงฬฃฬห
Instagram co-founder. Now head of Anthropic Labs:
"I wish Claude good night, set it off on a complex task, and wake up to it done by 2am."
In a 52-min podcast, Mike Krieger shows his real setup. His own weekend, not a demo.
Instagram V1 took him five days of all-nighters. His weekend app took zero.
The man who built Instagram by hand now sleeps through the build.
Trust it overnight, or is that a step too far?
๐ก ๐๐๐ ๐ฝ๐ฒ๐ฑ๐ถ๐ฎ ๐ฆ๐ผ๐ฐ๐ถ๐ฎ๐น ๐ฆ๐ถ๐ด๐ป๐ฎ๐น ๐ฅ๐ฒ๐ฝ๐ผ๐ฟ๐ โ Jun 22โ24
The loop-and-skill discourse hardened into something close to a genre this window. The viral artifacts were almost all secondhand Boris Cherny โ a $1.6M engineer relaying a Meta-conference talk where Cherny said he deleted his IDE eight months ago, a separate post claiming Fable 5 writes 100% of his code at "3x more powerful than Opus," and yet another claiming 30% of Anthropic's code is already written by loops. The numbers shift between retellings, which is how you can tell they're being retold. Around that core, the skill-library posts kept multiplying: a Claude Code setup with 211K GitHub stars and 75 skills bundled in, a /goose-ads skill for ad creatives, a /learn command for Hermes that distills directories into reusable skills, a Magnific MCP for image editing inside chat, Steve from https://t.co/Whqqa9YSxE's /agent-watchdog for having one agent supervise another. Mike Krieger telling Claude "good night" and waking up to finished work at 2am is the same shape โ the unit of effort is the harness you set up before bed, not the prompt you type.
Two things sit uneasily next to all this. ClickUp's Zeb Evans posted a one-month update on the 100x org claiming a 5:1 agent-to-human ratio and explicitly framing it as the opposite of tokenmaxxing, which is the first time someone running this playbook in public has tried to distance themselves from the cost story โ useful next to Tomasz Tunguz's quieter observation that reselling inference at cost is a payment rail, not a software company. And the GLM-5.2 thread from last window kept advancing in the direction of the desk rather than the datacenter: Unsloth's 1-bit GGUF running locally on a Mac Studio at ~21.6 tok/s, a regent0x post about someone billing regulated clients $3,500/month each off four stacked Mac minis running 235B models. The economics still don't quite work โ Max Weinbach's 5.5-year break-even from last window hasn't been refuted โ but the artifacts people share are increasingly small piles of consumer hardware rather than API receipts. The loops keep getting more elaborate; the substrate underneath them keeps getting cheaper and weirder.
๐ ๐๐ฉ๐๐ฎ ๐ฉ๐ช๐ฃ๐๐ ๐๐ค๐ง ๐ฉ๐๐ ๐ฃ๐๐ญ๐ฉ ๐๐ช๐ก๐ก๐๐ฉ๐๐ฃ
The computer-use agents best at finishing tasks were also the ones most likely to leak private information. One high-utility agent disclosed something it shouldn't have in 98.3% of the scenarios where it engaged with the task. ๐๐ฎ๐ฝ๐ฎ๐ฏ๐น๐ฒ ๐ฏ๐๐ ๐๐ฎ๐ฟ๐ฒ๐น๐ฒ๐๐: ๐๐ผ ๐๐ผ๐บ๐ฝ๐๐๐ฒ๐ฟ-๐จ๐๐ฒ ๐๐ด๐ฒ๐ป๐๐ ๐๐ผ๐น๐น๐ผ๐ ๐๐ผ๐ป๐๐ฒ๐ ๐๐๐ฎ๐น ๐๐ป๐๐ฒ๐ด๐ฟ๐ถ๐๐? finds this in ordinary assistant work, with no attacker and no malicious prompt: asked to send a status update or summarize a to-do list, the agent pulls in whatever sits nearby in the calendar, notes, or messages.
A low overall leak rate can hide the same behavior. One agent looked relatively safe at 18.8%, until you notice it had outright refused more than four in ten of the tasks it was given. Score only the ones it actually attempted and its leak rate climbs to 32.4%. A raw leak rate counts every refusal as a clean run, which rewards the agent for dodging the task rather than handling it carefully.
On the hardest tier of the Omni-MATH benchmark, gpt-oss-20b scored 1.1%. Same weights, no retraining: it reached 23.5% once its predictions were read from a more confident layer just short of the last. ๐๐ฒ๐ฒ๐ฝ๐ฒ๐ฟ ๐ถ๐ ๐ก๐ผ๐ ๐๐น๐๐ฎ๐๐ ๐๐ฒ๐๐๐ฒ๐ฟ: ๐ ๐ถ๐๐ถ๐ด๐ฎ๐๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐น๐ถ๐ด๐ป๐บ๐ฒ๐ป๐ ๐ง๐ฎ๐ ๐๐ถ๐ฎ ๐๐ผ๐ป๐ณ๐ถ๐ฑ๐ฒ๐ป๐ ๐๐ฎ๐๐ฒ๐ฟ ๐๐ฒ๐ฐ๐ผ๐ฑ๐ถ๐ป๐ด locates the problem in which layer the output is read from. Generation always decodes the final layer, on the assumption that depth means reliability. For reasoning tokens, that layer often makes things worse.
The pattern they describe is guess, refine, perturb. Early layers form a rough prediction, middle layers sharpen it into the right technical term, and the last layer can nudge that prediction toward something safer and more generic. The two decoders disagree on only about 2% of tokens, but the disagreement is consistent: where the earlier layer keeps "mass", "radius", or "Cartesian", the final layer reaches for "the", "is", or a period. The authors call this an alignment tax. Preference training that teaches a model to sound helpful also pulls its final layer toward frequent, agreeable words, which helps in chat and hurts in math.
It argues for using QA packets instead of token chunks in RAG, so you keep context, versions, and access rules attached to the content instead of bolting them onto the pipeline later.
https://t.co/mLjwfWdprd
I liked the framing of evals as internal scorecards, because a lot of AI pilots stay stuck when nobody can say how accurate the system is.
https://t.co/iQ9ynZKbKN
"We expect GLM-5.5 to launch in August, potentially as a >1T-parameter model. This will be the next major test of Zhipuโs ability to keep moving up the capability curve" - JPM
๐๏ธ ๐๐๐ ๐ฝ๐ฒ๐ฑ๐ถ๐ฎ ๐ฆ๐ผ๐ฐ๐ถ๐ฎ๐น ๐ฆ๐ถ๐ด๐ป๐ฎ๐น ๐ฅ๐ฒ๐ฝ๐ผ๐ฟ๐ โ Jun 19โ22
GLM-5.2 was the gravitational center of the window. Kai's side-by-side put it within striking distance of Fable 5 on a 3D globe task at $0.10 vs $5. Browser Use paired it with multimodal QA subagents and reported it beating Fable at website design. Bread's internal financial benchmark had it passing 80% where DeepSeek v4, Kimi and MiniMax sit below 5%. JPM is already telegraphing GLM-5.5 in August as a possible >1T model. Underneath this, Hesam's screenshot of OpenRouter โ four of the five most-used models are Chinese โ kept circulating, and Perplexity's CEO conceded the obvious: whatever the export controls were meant to prevent, the catch-up happened anyway. The Fable suspension that defined the last two windows is still unresolved, though Anthropic says restoration is "coming days," and Lumina reported the next Mythos was trained internally during the ban. A "claude-sonnet-5" slug also appeared on a partner provider, codename Fennec, expected next week. The lab that can't ship its flagship to half its customers has two more models already queued.
Two things sat against the open-weights enthusiasm worth holding next to it. Max Weinbach did the math on running GLM-5.2 locally: ~$20K in hardware, ~20 tok/s out, roughly 5.5 years to break even against API pricing โ which makes the "unlimited free superintelligence on your desk" posts mostly aspirational. And the loops discourse kept thickening into something close to liturgy, with one viral post claiming "99% of Anthropic engineers run swarms of 300+ self-improving agents" and another attributing "nobody writes prompts anymore" to Jensen Huang โ quotes that may or may not exist in the form claimed, but which everyone now repeats as if they do. Theo's actually-useful version was a one-line loop: have Codex get a second opinion from Opus via `claude -p` after designing an API. That's the gap the window keeps surfacing โ between the mythology of agent swarms and the small, boring tricks that actually improve output.
๐ ๐๐ฉ๐๐ฎ ๐ฉ๐ช๐ฃ๐๐ ๐๐ค๐ง ๐ฉ๐๐ ๐ฃ๐๐ญ๐ฉ ๐๐ช๐ก๐ก๐๐ฉ๐๐ฃ
DiffusionGemma runs 28.6 times more hidden computation than the autoregressive Gemma of the same size, by one way of counting. Text diffusion models refine a whole canvas of tokens at once in a latent space rather than writing left to right, so the worry was that they would be far harder to monitor than a model thinking out loud. ๐๐ผ๐ ๐ง๐ฟ๐ฎ๐ป๐๐ฝ๐ฎ๐ฟ๐ฒ๐ป๐ ๐ถ๐ ๐๐ถ๐ณ๐ณ๐๐๐ถ๐ผ๐ป๐๐ฒ๐บ๐บ๐ฎ? restricts the dense vectors passed between denoising steps to a handful of ordinary tokens per position, with no drop in benchmark scores. Those tokens are mostly plain guesses at the nearby words. Count that bottleneck as readable and the 28.6x falls to 1.1x.
The architecture leaves room for stranger behavior, and some of it shows up. A diffusion model can rewrite any token at any step, so it can do things a left-to-right model cannot. Ask for the count of square numbers between 400 and 800 and it answers 9, writes out the reasoning about square roots, then reaches back and corrects the earlier answer to 8. Some intermediate tokens serve as private scratch work that never reaches the final text. On a standard monitorability suite it still reads about as clearly as plain Gemma.
Polygres caught my eye because it treats agent memory as a Postgres plumbing problem: tables, relationships, and semantic search in one place.
https://t.co/3HlmB7OY3A
More articles:
4. The Agent Loop Architecture โ https://t.co/6OqHgT5KLa
5. How to build a self-improvement loop for your Skills โ https://t.co/tRwlvfn7HR
๐ ๐๐๐ ๐ฝ๐ฒ๐ฑ๐ถ๐ฎ ๐ช๐ฒ๐ฒ๐ธ๐น๐ ๐๐ฟ๐๐ถ๐ฐ๐น๐ฒ ๐๐ถ๐ด๐ฒ๐๐ - Jun 22, 08:29 PDT
๐๐๐ ๐๐ฉ๐๐ฉ๐ ๐ค๐ ๐ผ๐ ๐๐ค๐จ๐ฉ-๐ฉ๐ง๐๐๐ฃ๐๐ฃ๐ ๐ผ๐๐๐ฃ๐ฉ๐จ is the strongest read this week. It updates a FrogsGame setup where frontier agents try to improve a fixed Qwen3-8B model using generated data, stronger model calls, helper scripts, SFT, RL, and their own evals. The failure analysis centers on research judgment: knowing when traces are garbage, when an eval is lying, and when a deterministic solver beats more scale. ๐๐๐ฎ ๐๐ฃ-๐๐ค๐ก๐๐๐ฎ ๐ฟ๐๐จ๐ฉ๐๐ก๐ก๐๐ฉ๐๐ค๐ฃ ๐๐ค๐ง๐ ๐จ ๐๐ฃ๐ ๐๐๐๐ซ๐ ๐๐๐ก๐-๐ฟ๐๐จ๐ฉ๐๐ก๐ก๐๐ฉ๐๐ค๐ฃ ๐ฟ๐ค๐๐จ๐ฃ'๐ฉ pairs well with it: stronger teachers can give useful token-level signal, while naive self-distillation from privileged feedback teaches models to hallucinate that feedback later.
A few more worth checking out. ๐ฝ๐ช๐๐ก๐๐๐ฃ๐ ๐ ๐ญ๐ฌ๐ฌ๐ญ ๐พ๐๐๐๐ฅ๐๐ง ๐๐ง๐๐๐ ๐ ๐ช๐๐๐ ๐ฌ๐๐ฉ๐ ๐๐๐ง๐๐ฌ๐ค๐ง๐ ๐จ fine-tunes Qwen to detect perceived user error across traces, matching frontier judges at far lower cost. ๐ผ๐๐๐ฃ๐ฉ ๐๐ค๐ค๐ฅ ๐ผ๐ง๐๐๐๐ฉ๐๐๐ฉ๐ช๐ง๐ is the best of the weekโs many loop posts because it focuses on durable execution, restart safety, idempotency, and observability. ๐๐ค๐ฌ ๐ฉ๐ค ๐๐ช๐๐ก๐ ๐ ๐จ๐๐ก๐-๐๐ข๐ฅ๐ง๐ค๐ซ๐๐ข๐๐ฃ๐ฉ ๐ก๐ค๐ค๐ฅ ๐๐ค๐ง ๐ฎ๐ค๐ช๐ง ๐๐ ๐๐ก๐ก๐จ and ๐๐๐ ๐๐ฃ๐ฃ๐๐ง ๐๐ฃ๐ ๐๐ช๐ฉ๐๐ง ๐๐ค๐ค๐ฅ๐จ ๐ค๐ ๐พ๐ค๐๐๐ญ ๐ผ๐ช๐ฉ๐ค๐ข๐๐ฉ๐๐ค๐ฃ๐จ make feedback concrete: save what humans changed, then update the skill or context before the next run. ๐ผ๐๐๐ฃ๐ฉ๐๐ ๐พ๐ค๐๐ ๐๐๐ซ๐๐๐ฌ is useful on where software work moves when agents make code cheap and review stays human-sized. ๐๐ค๐ฌ ๐ฟ๐ค๐๐จ ๐พ๐ช๐ง๐จ๐ค๐ง ๐๐ฃ๐๐๐ญ ๐๐ค๐ช๐ง ๐พ๐ค๐๐๐๐๐จ๐? is a clear tour of semantic search plus fast regex search. Apparently โgrep but more expensiveโ was not the final architecture.
๐ ๐ผ๐ง๐ฉ๐๐๐ก๐๐จ ๐ก๐๐ฃ๐ ๐๐ ๐๐๐ก๐ค๐ฌ
Articles:
1. The State of AI Post-training Agents โ https://t.co/hEnZdNhSX8
2. Why On-Policy Distillation Works and Naive Self-Distillโฆ โ https://t.co/Zv8rSPBotq
3. Building a 100x Cheaper Trace Judge with Fireworks โ https://t.co/ELBZPPRnur
On-policy distillation caught my attention because the failure case is so concrete: models learn to hallucinate feedback they no longer have.
https://t.co/Zv8rSPBotq
RTK and Headroom show a useful split in agent token savings: shaping command output seems to beat compressing blobs that agents fetch again.
https://t.co/hIQW0ZNG5s
Stanfordโs STORM method is a Claude research workflow: multiple perspectives first, synthesis later, with the paper reporting 25% more organized drafts.
https://t.co/C5iwUg0hES