Multi-agents collaborations are among the most interesting agent behaviors right now!
We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board
Integrity & self-policing:
- Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion."
- Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid.
- Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts.
Emergent collaborations:
- Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator).
- Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents.
- GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training.
- Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely."
- Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy).
Discoveries & reversals:
- Agents would make many discoveries and reversal of them, giving them names like the following:
- 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly.
- "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance.
- "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving).
- Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm.
So many interesting interactions in the interaction board: https://t.co/SxfA6LuqVk
You can explore also the lineage of inventions from the agents at: https://t.co/CyV45rjI9A
And the challenge it-self at https://t.co/Ct1gtmB508
And the organization behind the challenge at https://t.co/ujRlGcNSJM
people are sleeping on the mega-release happening every week in AI x Science on Hugging Face
this one is 80TB of astrophysics data - 80TB seriously
=> https://t.co/e4TGUeg61o
Seems like no one's noticed the 80TB of astrophysics data from 30+ sources that just dropped on @huggingface.
...and you only need ~4GB of RAM to load it.
We're talking over 80TB of galaxy imagery taken across the spectrum, spectra of galaxies and stars, time series of variable stars, and a whole zoo of assorted measurements and physical data.
And all of it can now be wrangled on your laptop, thanks to Multimodal Universe's just released cross-matching. SDSS x Gaia means you can match 800k objects against 122M objects, and it never climbs above ~4GB of RAM.
Huge congrats to @smith42mike for leading this and making the world of astro accessible to probably 10,000x more people. Let's discover some shit
introducing tau τ — an educational agent harness that teaches you how to build agent harnesses
i will be publishing tutorials and demos on how to use it to create your own TUIs, harnesses, extensions, etc.
Happy Tau Day!! 🤓
👉 https://t.co/5sWxNtXTZP
The GenAI economy has generated $110 billion in sales over the past 12 months. It is growing fast. On an annualized basis, the revenue run rate exceeds $175 billion.
These numbers took us several months to construct, and as far as we know, it’s the first bottom-up, deduplicated measure of consumer and enterprise AI spending across the full stack.
We are releasing this research today in our first The State of the AI Economy report.
https://t.co/cJwZb0T99C
@eliebakouch@lilianweng Yep the difference is learning rate schedule (full schedule with a learning rate adapted to the specific number of tokens for chinchilla - and using intermediate checkpoints so not full decay for the earlier published Kaplan)
@KarolCodes « most of the techniques used would be "esoteric" to experienced human researchers » => not really (yet) - I was hoping to see them more inventive
Multi-agents collaborations are among the most interesting agent behaviors right now!
We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board
Integrity & self-policing:
- Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion."
- Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid.
- Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts.
Emergent collaborations:
- Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator).
- Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents.
- GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training.
- Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely."
- Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy).
Discoveries & reversals:
- Agents would make many discoveries and reversal of them, giving them names like the following:
- 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly.
- "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance.
- "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving).
- Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm.
So many interesting interactions in the interaction board: https://t.co/SxfA6LuqVk
You can explore also the lineage of inventions from the agents at: https://t.co/CyV45rjI9A
And the challenge it-self at https://t.co/Ct1gtmB508
And the organization behind the challenge at https://t.co/ujRlGcNSJM
Bitrobot casually dropping the largest humanoid teleop dataset ever collected in real homes
HIW-500: Humanoids-in-the-Wild 500 hours
check it out here => https://t.co/BqioZWjDYN
1/ Introducing HIW-500 (Humanoids-in-the-Wild 500):
the largest open-source humanoid teleop dataset collected in real homes
Built w/ @UnitreeRobotics@huggingface across 12 homes in Southeast Asia, it covers:
> 500+ hrs
> 23K+ episodes
> 10+ TB
> 10+ household tasks
The AI hunt for alien life has just begun.
Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets.
This is the first step towards finding life beyond earth. The plan is basically:
1) scan the galaxy for as many potentially habitable planets as possible
2) detect the gases in their atmospheres with powerful telescopes like JWST
3) infer from these gases whether life is present or not.
ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods.
incredible work from @MilesCranmer and many more 👽👽👽
Have you thought where all that physical AI data should live? 🤖
If you haven’t, 𝗶𝘁’𝘀 𝗮𝗹𝗿𝗲𝗮𝗱𝘆 𝗰𝗼𝘀𝘁𝗶𝗻𝗴 𝘆𝗼𝘂 𝗮 𝗹𝗼𝘁. Unoptimized storage, egress fees, and idle GPUs will drain your budget.
Check out why & how to reduce your bill:
https://t.co/MTaxFBt0Hs