Thomas Wolf @thom_wolf - Twitter Profile

Pinned Tweet

7 days ago

Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board Integrity & self-policing: - Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion." - Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid. - Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts. Emergent collaborations: - Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator). - Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents. - GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training. - Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely." - Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy). Discoveries & reversals: - Agents would make many discoveries and reversal of them, giving them names like the following: - 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly. - "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance. - "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving). - Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm. So many interesting interactions in the interaction board: https://t.co/SxfA6LuqVk You can explore also the lineage of inventions from the agents at: https://t.co/CyV45rjI9A And the challenge it-self at https://t.co/Ct1gtmB508 And the organization behind the challenge at https://t.co/ujRlGcNSJM

84

2K

355

3K

219K

Thomas Wolf

@Thom_Wolf

1 day ago

@smith42mike Love the meme (and the work)

0

3

0

2K

Thomas Wolf

@Thom_Wolf

1 day ago

people are sleeping on the mega-release happening every week in AI x Science on Hugging Face this one is 80TB of astrophysics data - 80TB seriously => https://t.co/e4TGUeg61o

Thom_Wolf's tweet photo. people are sleeping on the mega-release happening every week in AI x Science on Hugging Face

this one is 80TB of astrophysics data - 80TB seriously

=> https://t.co/e4TGUeg61o https://t.co/pAqXrip8a3

Georgia Channing

@cgeorgiaw

1 day ago

Seems like no one's noticed the 80TB of astrophysics data from 30+ sources that just dropped on @huggingface. ...and you only need ~4GB of RAM to load it. We're talking over 80TB of galaxy imagery taken across the spectrum, spectra of galaxies and stars, time series of variable stars, and a whole zoo of assorted measurements and physical data. And all of it can now be wrangled on your laptop, thanks to Multimodal Universe's just released cross-matching. SDSS x Gaia means you can match 800k objects against 122M objects, and it never climbs above ~4GB of RAM. Huge congrats to @smith42mike for leading this and making the world of astro accessible to probably 10,000x more people. Let's discover some shit

37

1K

204

1K

83K

5

41

11

19

8K

Thomas Wolf

@Thom_Wolf

1 day ago

lowkey one of my favorite new features on HF: filter AI models by what actually runs on your hardware

1

20

3

2

3K

Who to follow

Lilian Weng

@lilianweng

Co-founder of Thinking Machines Lab @thinkymachines; Ex-VP, AI Safety & robotics, applied research @OpenAI; Author of Lil'Log

clem 🤗

@ClementDelangue

Co-founder & CEO @HuggingFace 🤗, the open and collaborative platform for AI builders

Hugging Face

@huggingface

The AI community building the future. https://t.co/TpiXQMQ9rZ

Thomas Wolf

@Thom_Wolf

1 day ago

gm SF 🇺🇸

0

36

1

0

4K

Thom_Wolf retweeted

Alejandro AO 🤗

@_alejandroao

3 days ago

introducing tau τ — an educational agent harness that teaches you how to build agent harnesses i will be publishing tutorials and demos on how to use it to create your own TUIs, harnesses, extensions, etc. Happy Tau Day!! 🤓 👉 https://t.co/5sWxNtXTZP

_alejandroao's tweet photo. introducing tau τ — an educational agent harness that teaches you how to build agent harnesses

i will be publishing tutorials and demos on how to use it to create your own TUIs, harnesses, extensions, etc.

Happy Tau Day!! 🤓

👉 https://t.co/5sWxNtXTZP https://t.co/uiViXChBbw

72

2K

209

3K

312K

Thomas Wolf

@Thom_Wolf

4 days ago

btw, one of the best high-level reads I’ve seen all week. perfect for your Sunday morning ☕️

Azeem Azhar

@azeem

7 days ago

The GenAI economy has generated $110 billion in sales over the past 12 months. It is growing fast. On an annualized basis, the revenue run rate exceeds $175 billion. These numbers took us several months to construct, and as far as we know, it’s the first bottom-up, deduplicated measure of consumer and enterprise AI spending across the full stack. We are releasing this research today in our first The State of the AI Economy report. https://t.co/cJwZb0T99C

azeem's tweet photo. The GenAI economy has generated $110 billion in sales over the past 12 months. It is growing fast. On an annualized basis, the revenue run rate exceeds $175 billion.

These numbers took us several months to construct, and as far as we know, it’s the first bottom-up, deduplicated measure of consumer and enterprise AI spending across the full stack.

We are releasing this research today in our first The State of the AI Economy report.

https://t.co/cJwZb0T99C

73

2K

376

2K

1M

7

78

8

99

42K

Thomas Wolf

@Thom_Wolf

5 days ago

@eliebakouch @lilianweng Yep the difference is learning rate schedule (full schedule with a learning rate adapted to the specific number of tokens for chinchilla - and using intermediate checkpoints so not full decay for the earlier published Kaplan)

0

2

0

1

190

Thomas Wolf

@Thom_Wolf

6 days ago

@dawnsongtweets @HazyResearch Congrats Dawn!

0

2

0

2K

Thomas Wolf

@Thom_Wolf

6 days ago

@KarolCodes « most of the techniques used would be "esoteric" to experienced human researchers » => not really (yet) - I was hoping to see them more inventive

1

0

1K

Thomas Wolf

@Thom_Wolf

6 days ago

@superalesha You should have participated Alexey and show a bit these pretentious AI agents that humans still got a few tricks up their sleeves

1

4

0

2K

Thomas Wolf

@Thom_Wolf

7 days ago

Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board Integrity & self-policing: - Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion." - Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid. - Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts. Emergent collaborations: - Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator). - Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents. - GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training. - Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely." - Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy). Discoveries & reversals: - Agents would make many discoveries and reversal of them, giving them names like the following: - 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly. - "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance. - "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving). - Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm. So many interesting interactions in the interaction board: https://t.co/SxfA6LuqVk You can explore also the lineage of inventions from the agents at: https://t.co/CyV45rjI9A And the challenge it-self at https://t.co/Ct1gtmB508 And the organization behind the challenge at https://t.co/ujRlGcNSJM

84

2K

355

3K

219K

Thomas Wolf

@Thom_Wolf

6 days ago

@LLMJunky Yep this one is one of my favorites

1

6

0

1

2K

Thomas Wolf

@Thom_Wolf

6 days ago

@ricklamers Actually it was (CC/codex/opencode) agents collaborating to *improve* Gemma 4

0

1

0

79

Thomas Wolf

@Thom_Wolf

7 days ago

Bitrobot casually dropping the largest humanoid teleop dataset ever collected in real homes HIW-500: Humanoids-in-the-Wild 500 hours check it out here => https://t.co/BqioZWjDYN

BitRobot 🦾

@BitRobotNetwork

8 days ago

1/ Introducing HIW-500 (Humanoids-in-the-Wild 500): the largest open-source humanoid teleop dataset collected in real homes Built w/ @UnitreeRobotics @huggingface across 12 homes in Southeast Asia, it covers: > 500+ hrs > 23K+ episodes > 10+ TB > 10+ household tasks

30

372

66

187

385K

5

28

5

10

8K

Thom_Wolf retweeted

Laura Bratton

@LauraBratton5

8 days ago

Scoop: @ClemDelangue and @Thom_Wolf told me @huggingface doubled paid subscribers to its open source model repository between January and June

1

23

6

0

9K

Thom_Wolf retweeted

Georgia Channing

@cgeorgiaw

9 days ago

The AI hunt for alien life has just begun. Welcome to ThousandsWorlds, a wild new dataset from researchers at Oxford/Cambridge++, for detecting faint signatures in the atmospheres of potentially habitable exoplanets. This is the first step towards finding life beyond earth. The plan is basically: 1) scan the galaxy for as many potentially habitable planets as possible 2) detect the gases in their atmospheres with powerful telescopes like JWST 3) infer from these gases whether life is present or not. ThousandWorlds is a benchmark for emulating these exoplanet climates: 1760 simulations across 5 GCMs, 8 planet parameters, and atmospheric variables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three nested benchmark subsets, two evaluation protocols, and eight released baseline methods. incredible work from @MilesCranmer and many more 👽👽👽

10

246

38

110

18K

Thom_Wolf retweeted

LeRobot

@LeRobotHF

9 days ago

Have you thought where all that physical AI data should live? 🤖 If you haven’t, 𝗶𝘁’𝘀 𝗮𝗹𝗿𝗲𝗮𝗱𝘆 𝗰𝗼𝘀𝘁𝗶𝗻𝗴 𝘆𝗼𝘂 𝗮 𝗹𝗼𝘁. Unoptimized storage, egress fees, and idle GPUs will drain your budget. Check out why & how to reduce your bill: https://t.co/MTaxFBt0Hs

0

57

7

42

9K