Xin Liu

5 months ago

A common heuristic in LLM agent design—"more agents is better"—might be wrong. Across 180 configurations, we find multi-agent coordination is task-contingent: +81% on parallelizable tasks (finance), but -70% on sequential ones (planning). Architecture-task alignment matters more than agent count.

GoogleResearch's tweet photo. A common heuristic in LLM agent design—"more agents is better"—might be wrong.

Across 180 configurations, we find multi-agent coordination is task-contingent: +81% on parallelizable tasks (finance), but -70% on sequential ones (planning). Architecture-task alignment matters more than agent count.

39

767

97

399

67K

Assistant Professor Comp. Science @NorthwesternU Sensing, Perception, Interactive Computing and Experiences (SPICE) Lab: https://t.co/WubfxDJczd

6 months ago

Thanks VentureBeat covering our work!

VentureBeat

@VentureBeat

6 months ago

The AI industry hype says "more agents is all you need," but new data shows that strictly sequential tasks and tool-heavy integrations fail at scale. https://t.co/vLVUcMr2aU

1

32

10

11

6K

0

1

0

120

xliucs retweeted

VentureBeat

@VentureBeat

6 months ago

The AI industry hype says "more agents is all you need," but new data shows that strictly sequential tasks and tool-heavy integrations fail at scale. https://t.co/vLVUcMr2aU

1

32

10

11

6K

Who to follow

Karan Ahuja

@realkaranahuja

Xuhai “Orson” Xu

@Orson_Xu

Assistant Prof @ColumbiaDBMI @ColumbiaCompSci @Columbia. Visiting Faculty @Google. #HCI #AI for #health & well-being. Alum @MITEECS @UW_iSchool @Tsinghua_Uni

Vivian Shen

@vhshen

Asst. Prof @ETC @CMU PhD from @CMU_Robotics, FIGLab @cmuHCII.

6 months ago

Check out our latest work in scaling agents! Led by amazing Yubin Kim!

elvis

@omarsar0

6 months ago

Major new research from Google and MIT. "More agents is all you need" has become a mantra for AI developers. We know multi-agent systems can be effective, but we do this mostly based on heuristics. The default approach to building complex AI systems today remains adding more agents, more coordination, more communication. It would be helpful to have a more principled way to scale agentic systems. This new research introduces the first quantitative scaling principles for agent systems, testing 180 configurations across three LLM families (OpenAI, Google, Anthropic) and four agentic benchmarks spanning financial reasoning, web navigation, game planning, and workflow execution. The findings: Multi-agent systems show an overall mean MAS improvement of -3.5% across all benchmarks, with massive variance ranging from +81% improvement to -70% degradation depending on task structure and architecture. Three dominant effects emerge from the data: The tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead. The efficiency penalty compounds as environmental complexity increases. A task with 16 tools makes even the most efficient multi-agent architecture paradoxically less effective than a single agent. The capability ceiling: once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. This is quantified as a statistically significant effect. Additional agents simply cannot overcome the coordination tax when baseline performance is already reasonable. Architecture-dependent error amplification: independent multi-agent systems amplify errors 17.2x through unchecked propagation. Centralized coordination contains this to 4.4x via validation bottlenecks (these catch errors before propagation). The presence or absence of inter-agent verification determines whether collaboration corrects or catastrophically compounds mistakes. The performance heterogeneity is also interesting to look at: - On parallelizable financial reasoning tasks, centralized multi-agent coordination achieves +80.9% improvement. - On sequential planning tasks requiring constraint satisfaction, every multi-agent variant tested degraded performance by 39-70%. - Decentralized coordination excels on dynamic web navigation (+9.2%) but provides essentially no benefit elsewhere. The researchers derive a predictive model achieving cross-validated 𝑅^2=0.513 that correctly predicts the optimal architecture for 87% of held-out configurations. This model contains no dataset-specific parameters, enabling generalization to unseen task domains. Overall, architecture-task alignment, not the number of agents, determines collaborative success. The research replaces heuristic guidance with quantitative principles: measure task decomposability, tool complexity, and baseline difficulty, then select a coordination structure accordingly. Paper: https://t.co/6QY8rT15Pd Learn to build effective AI agents in my academy: https://t.co/JBU5beIoD0

omarsar0's tweet photo. Major new research from Google and MIT.

"More agents is all you need" has become a mantra for AI developers. We know multi-agent systems can be effective, but we do this mostly based on heuristics.

The default approach to building complex AI systems today remains adding more agents, more coordination, more communication.

It would be helpful to have a more principled way to scale agentic systems.

This new research introduces the first quantitative scaling principles for agent systems, testing 180 configurations across three LLM families (OpenAI, Google, Anthropic) and four agentic benchmarks spanning financial reasoning, web navigation, game planning, and workflow execution.

The findings:

Multi-agent systems show an overall mean MAS improvement of -3.5% across all benchmarks, with massive variance ranging from +81% improvement to -70% degradation depending on task structure and architecture.

Three dominant effects emerge from the data:

The tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead. The efficiency penalty compounds as environmental complexity increases.

A task with 16 tools makes even the most efficient multi-agent architecture paradoxically less effective than a single agent.

The capability ceiling: once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. This is quantified as a statistically significant effect. Additional agents simply cannot overcome the coordination tax when baseline performance is already reasonable.

Architecture-dependent error amplification: independent multi-agent systems amplify errors 17.2x through unchecked propagation. Centralized coordination contains this to 4.4x via validation bottlenecks (these catch errors before propagation).

The presence or absence of inter-agent verification determines whether collaboration corrects or catastrophically compounds mistakes.

The performance heterogeneity is also interesting to look at:

- On parallelizable financial reasoning tasks, centralized multi-agent coordination achieves +80.9% improvement.

- On sequential planning tasks requiring constraint satisfaction, every multi-agent variant tested degraded performance by 39-70%.

- Decentralized coordination excels on dynamic web navigation (+9.2%) but provides essentially no benefit elsewhere.

The researchers derive a predictive model achieving cross-validated
𝑅^2=0.513 that correctly predicts the optimal architecture for 87% of held-out configurations. This model contains no dataset-specific parameters, enabling generalization to unseen task domains.

Overall, architecture-task alignment, not the number of agents, determines collaborative success. The research replaces heuristic guidance with quantitative principles: measure task decomposability, tool complexity, and baseline difficulty, then select a coordination structure accordingly.

Paper: https://t.co/6QY8rT15Pd
Learn to build effective AI agents in my academy: https://t.co/JBU5beIoD0

55

901

166

979

75K

0

3

0

173

xliucs retweeted

elvis

@omarsar0

6 months ago

Major new research from Google and MIT. "More agents is all you need" has become a mantra for AI developers. We know multi-agent systems can be effective, but we do this mostly based on heuristics. The default approach to building complex AI systems today remains adding more agents, more coordination, more communication. It would be helpful to have a more principled way to scale agentic systems. This new research introduces the first quantitative scaling principles for agent systems, testing 180 configurations across three LLM families (OpenAI, Google, Anthropic) and four agentic benchmarks spanning financial reasoning, web navigation, game planning, and workflow execution. The findings: Multi-agent systems show an overall mean MAS improvement of -3.5% across all benchmarks, with massive variance ranging from +81% improvement to -70% degradation depending on task structure and architecture. Three dominant effects emerge from the data: The tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead. The efficiency penalty compounds as environmental complexity increases. A task with 16 tools makes even the most efficient multi-agent architecture paradoxically less effective than a single agent. The capability ceiling: once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. This is quantified as a statistically significant effect. Additional agents simply cannot overcome the coordination tax when baseline performance is already reasonable. Architecture-dependent error amplification: independent multi-agent systems amplify errors 17.2x through unchecked propagation. Centralized coordination contains this to 4.4x via validation bottlenecks (these catch errors before propagation). The presence or absence of inter-agent verification determines whether collaboration corrects or catastrophically compounds mistakes. The performance heterogeneity is also interesting to look at: - On parallelizable financial reasoning tasks, centralized multi-agent coordination achieves +80.9% improvement. - On sequential planning tasks requiring constraint satisfaction, every multi-agent variant tested degraded performance by 39-70%. - Decentralized coordination excels on dynamic web navigation (+9.2%) but provides essentially no benefit elsewhere. The researchers derive a predictive model achieving cross-validated 𝑅^2=0.513 that correctly predicts the optimal architecture for 87% of held-out configurations. This model contains no dataset-specific parameters, enabling generalization to unseen task domains. Overall, architecture-task alignment, not the number of agents, determines collaborative success. The research replaces heuristic guidance with quantitative principles: measure task decomposability, tool complexity, and baseline difficulty, then select a coordination structure accordingly. Paper: https://t.co/6QY8rT15Pd Learn to build effective AI agents in my academy: https://t.co/JBU5beIoD0

55

901

166

979

75K

xliucs retweeted

Akshay Paruchuri

@Yahskapar

7 months ago

Our work is a start toward better analysis and benchmarking of LLMs on real-world health information-seeking conversational dynamics, not just static Q&A or synthetic, simulated conversational datasets. This was an exciting collaboration with @MonicaNAgrawal at @DukeU alongside the rest of the Duke team consisting of Maryam Aziz, Rohit Vartak, Ayman Ali, and Best Uchehara, as well as @xliucs and Ishan Chatterjee at @UW. 👉 Check out the full paper for more details! 📜 Paper: https://t.co/iO7MDuD1Ul 💾 Dataset: https://t.co/tknqha0YGx 🧑🏿‍💻 Code: https://t.co/QZPnBMp1JX

0

7

2

637

xliucs retweeted

Ken Gu @kenqgu

8 months ago

True intelligence = reasoning about new information, not memorized facts. How can we scalably create benchmarks that are completely novel yet have known answers? Meet SynthWorlds, an eval & data-gen framework to disentangle reasoning and knowledge⬇️🧵 📄https://t.co/ITwP4YdtDG

kenqgu's tweet photo. True intelligence = reasoning about new information, not memorized facts.

How can we scalably create benchmarks that are completely novel yet have known answers?

Meet SynthWorlds, an eval & data-gen framework to disentangle reasoning and knowledge⬇️🧵

📄https://t.co/ITwP4YdtDG https://t.co/mMBj6lKg7E

4

107

13

80

10K

9 months ago

✨ I’m thrilled to share our latest research on building personal health agents at Google! I believe it will pave the way for the future of AI-driven personal health. Please check it out! 🚀

9 months ago

Learn about our research prototype LLM-powered personal health agent that analyzes various data modalities, including data from wearable devices, to offer evidence-based health insights and to provide a personalized coaching experience. Read more →https://t.co/WcHpasRidz

7

276

44

134

25K

0

3

0

395

xliucs retweeted

Yossi Matias

@ymatias

10 months ago

New paper in Nature Medicine introduces PH-LLM, a Gemini-based model for personalized health. It integrates wearable device data to provide insights and coaching for sleep and fitness. The model exceeded human expert scores on professional exams and performed on par with experts on real-world case studies. https://t.co/GYfYcD6RQ4

0

12

3

2

1K

xliucs retweeted

11 months ago

Let your wearable data "speak" for itself! Introducing SensorLM, a family of sensor-language foundation models trained on ~60 million hours of data, enabling robust wearable data understanding with natural language. → https://t.co/1vL6df5pMa

20

1K

140

407

74K

11 months ago

We just released a paper on our second-generation Large Sensor Model! Check it out!

11 months ago

Introducing LSM-2, our newest foundation model for wearable sensor data. LSM-2 uses Adaptive & Inherited Masking, a novel self-supervised framework, to learn from incomplete data & achieve strong performance without requiring explicit imputation. More → https://t.co/jeMvzVupZg

GoogleResearch's tweet photo. Introducing LSM-2, our newest foundation model for wearable sensor data. LSM-2 uses Adaptive & Inherited Masking, a novel self-supervised framework, to learn from incomplete data & achieve strong performance without requiring explicit imputation. More → https://t.co/jeMvzVupZg https://t.co/Jb9yg9PWqw

9

351

53

110

23K

2

9

0

492

12 months ago

Check out latest work on learning the language of wearable sensors.

Yuzhe Yang

@yang_yuzhe

12 months ago

🚨 Let your wearable data "speak" for themselves! ⌚️🗣️ Introducing *SensorLM*, a family of sensor-language foundation models, trained on ~60 million hours of data from >103K people, enabling robust wearable sensor data understanding with natural language. 🧵

yang_yuzhe's tweet photo. 🚨 Let your wearable data "speak" for themselves! ⌚️🗣️

Introducing *SensorLM*, a family of sensor-language foundation models, trained on ~60 million hours of data from >103K people, enabling robust wearable sensor data understanding with natural language. 🧵 https://t.co/l3aL0Ud8EY

5

231

39

136

18K

1

15

0

831

about 1 year ago

Check out latest work led by my amazing intern @kenqgu. We systematically studied are LLMs truly ready for autonomous data science tasks.

Ken Gu @kenqgu

about 1 year ago

🚨Are LLMs truly ready for autonomous data science? Real-world data is messy—missing values, outliers, inconsistencies—and if not handled properly, can lead to wrong conclusions. 🌟We introduce RADAR, a benchmark evaluating whether LLMs can handle imperfect tabular data. 🧵

kenqgu's tweet photo. 🚨Are LLMs truly ready for autonomous data science?

Real-world data is messy—missing values, outliers, inconsistencies—and if not handled properly, can lead to wrong conclusions.

🌟We introduce RADAR, a benchmark evaluating whether LLMs can handle imperfect tabular data. 🧵 https://t.co/TSYhtmv48E

4

152

36

150

20K

0

2

0

1

288

xliucs retweeted

Allen School @uwcse

about 1 year ago

Congratulations to @UW #UWAllen Ph.D. grads @sharma_ashish_2 & @sewon__min, @TheOfficialACM Doctoral Dissertation Award honorees! Sharma won for #AI tools for mental health; Min received honorable mention for efficient, flexible language models. #ThisIsUW https://t.co/R2b1r3wxUP

1

102

18

5

33K

over 1 year ago

Check out latest agent planning work!

Mihir Parmar

@Mihir3009

over 1 year ago

🎉 𝐄𝐱𝐜𝐢𝐭𝐞𝐝 𝐭𝐨 𝐬𝐡𝐚𝐫𝐞 𝐭𝐡𝐚𝐭 𝐨𝐮𝐫 𝐧𝐞𝐰 𝐩𝐚𝐩𝐞𝐫, "𝐏𝐥𝐚𝐧𝐆𝐄𝐍", 𝐢𝐬 𝐧𝐨𝐰 𝐨𝐮𝐭! 🎉 💡PlanGEN is a model-agnostic, and easily scalable multi-agent framework utilizing inference-time algorithms designed to generate natural planning and reasoning trajectories to solve complex tasks. 📊PlanGEN shows improvement on challenging benchmarks including NATURAL PLAN, OlympiadBench, DocFinQA, and GPQA. Please check out our full paper @ https://t.co/s75BOkUVGw #NLProc #LLMs #Planning #Reasoning #Agents #AI (1/5)

Mihir3009's tweet photo. 🎉 𝐄𝐱𝐜𝐢𝐭𝐞𝐝 𝐭𝐨 𝐬𝐡𝐚𝐫𝐞 𝐭𝐡𝐚𝐭 𝐨𝐮𝐫 𝐧𝐞𝐰 𝐩𝐚𝐩𝐞𝐫, "𝐏𝐥𝐚𝐧𝐆𝐄𝐍", 𝐢𝐬 𝐧𝐨𝐰 𝐨𝐮𝐭! 🎉

💡PlanGEN is a model-agnostic, and easily scalable multi-agent framework utilizing inference-time algorithms designed to generate natural planning and reasoning trajectories to solve complex tasks.

📊PlanGEN shows improvement on challenging benchmarks including NATURAL PLAN, OlympiadBench, DocFinQA, and GPQA.

Please check out our full paper @ https://t.co/s75BOkUVGw

#NLProc #LLMs #Planning #Reasoning #Agents #AI

(1/5)

6

61

13

29

21K

1

3

0

352

over 1 year ago

Check out our latest benchmark!

Mehran Kazemi @kazemi_sm

over 1 year ago

Is BIG-Bench Hard too easy for your LLM? We just unleashed BIG-Bench EXTRA Hard (BBEH)! 😈 Every task, harder! Every model, humbled! (Poem Credit: Gemini 2.0 Flash) Massive headroom for progress across various areas in general reasoning 🤯

kazemi_sm's tweet photo. Is BIG-Bench Hard too easy for your LLM?
We just unleashed BIG-Bench EXTRA Hard (BBEH)! 😈
Every task, harder! Every model, humbled! (Poem Credit: Gemini 2.0 Flash)
Massive headroom for progress across various areas in general reasoning 🤯 https://t.co/8TIuAdkVSt

8

234

33

94

53K

0

5

0

360

xliucs retweeted

Mehran Kazemi @kazemi_sm

over 1 year ago

Is BIG-Bench Hard too easy for your LLM? We just unleashed BIG-Bench EXTRA Hard (BBEH)! 😈 Every task, harder! Every model, humbled! (Poem Credit: Gemini 2.0 Flash) Massive headroom for progress across various areas in general reasoning 🤯

8

234

33

94

53K