Image generators look impressive but quietly cheat.
They score well on FID by producing near-duplicate outputs.
The catch is mode collapse hiding behind a clean number.
Coverage of the real data distribution suffers in silence.
A new paper introduces the Recursive Token Mapper.
It replaces the single-pass mapping network in style-based generators.
Most networks fix every style attribute in one shot. This one refines latent tokens over many cycles.
Depth comes from recursion, not extra layers.
Four design choices make it work:
> Nested inner and outer loops
> Original noise re-injected each cycle
> MLP-Mixer block avoids attention cost
> Gradients only on final step
On CIFAR-10 and CelebA-HQ, the score drops around 30% versus baseline.
Precision and recall hit state of the art.
Inference still runs in a single step.
What does real diversity unlock next?
@cv_usk YES, the CVE run makes that concrete, 100% accuracy at 85.1% fewer tokens, while the other systems stayed under 25%. Only word to push back on is "unprecedented" though, CodeAct was doing code-as-actions back at ICML 2024.
20 of 25 top AI researchers say AI will soon build AI.
For decades, humans built every AI system from scratch.
That assumption is quietly breaking down inside frontier labs.
A new paper interviewed 25 top researchers from Stanford, OpenAI, Google DeepMind, and Anthropic.
Their conclusion lands hard.
20 out of 25 now rank automated AI research as the most urgent risk in the field.
Why? Because models are moving from assistants to autonomous developers.
The paper maps three stages of this shift:
1. AI speeds up researcher productivity
2. AI handles complex research sub-tasks
3. AI runs full research cycles alone
The trigger signals to watch:
> 40+ hour autonomous task horizons
> Error-free code at massive scale
> Wins in elite math competitions
One bottleneck still resists automation.
Research taste, knowing which ideas matter, stays stubbornly human.
So what happens when the systems building the systems stop needing us?
Coding agents don't need a better vector database. They need grep.
Direct Corpus Interaction drops embeddings and lets the agent search raw files with grep, find, cat, and sed, refining each query from what the last one returned.
On BrowseComp-Plus: +11.0pp accuracy at $424 lower cost. GrepSeek runs the shell search up to 7.6x faster.
Why RAG breaks for agents, and what replaces it.
Perplexity stopped treating search as one API call.
Its agents now write Python that fans out queries, filters results, and joins evidence in a sandbox.
On a 200+ CVE task: 100% accuracy, 85.1% fewer tokens.
The SDK is private. The pattern isn't, so lets use it with Hermes?
The benchmark ranking AI coding agents was wrong 32% of the time.
DeepSWE is a new open benchmark that fixes this.
Tasks span 91 real codebases, average 668 lines changed, and are written from scratch so no model has seen the answer.
Its error rate: 1.4%.
@ECLresearch YES, the split is the strongest idea here. tacit workflow knowledge lives in work.md, while persona.md carries tone and rules. Whether work.md captures it well enough is the paper's open question, the behavioral fidelity frontier, still unmeasured.
Shanghai AI Lab's COLLEAGUE.SKILL is the #1 Paper of the Day on Hugging Face: the paper behind dot-skill, the viral tool that distills a departing coworker into an agent skill.
The design in 5 lines:
> Split capability (work.md) from behavior (persona.md)
> Three entry points: full, work-only, persona-only
> Correct it in plain English: patch a section, or log {scene, wrong, correct}
> Every update archives a version, roll back to any of the last 10
> Schema v3, 7 files, runs in Claude Code, Codex, OpenClaw, Hermes
Work-only is the safer default: a review checklist does not need a personality.
A new paper just exposed a setting that changes how AI models scale.
Scaling laws tell labs how big a model should be for a given amount of data.
Until now, that math was always done in tokens.
A new paper rewrites the rule in bytes.
The team trained 988 models, from 50M to 7B parameters, sweeping tokenizers, sizes, and data together.
They found the optimal ratio is roughly 60 bytes per parameter for English, and it stays constant across compute budgets.
Tokens were hiding the real signal.
Three takeaways from the work:
1. Match bytes to parameters, not tokens
2. Optimal compression drops as compute grows
3. Smaller vocabularies can beat larger ones
The pattern holds for both subword and latent tokenizers.
It also shifts by language, scaling with how many bytes that language needs versus English.
Compression is now a tunable axis in pretraining.
Most labs have been ignoring it.