Minlan Yu @MinlanYu - Twitter Profile

2 months ago

If you use multi-agent workflow, you may try out Orla; it's easy to integrate with your current system but will help you save cost and latency.

Rana Shahout @rana_shahout

2 months ago

#agentic systems have complex workflows, not just a single LLM call. That’s why we built #Orla. You define the workflow (e.g., with LangChain) and your providers (e.g. closed models, or your hosted models). #Orla separates workflow management from prompt and query flows

1

2

0

1

311

0

4

1

0

207

minlanyu retweeted

Chutes

@chutes_ai

3 months ago

We have some big news to share today. Chutes is partnering with a research team from Harvard University to push the boundaries of AI inference efficiency. The team at Harvard, led by Professor Juncheng Yang @1a1a11a, is developing a new prefix caching algorithm designed to significantly accelerate inference while reducing hardware usage.

chutes_ai's tweet photo. We have some big news to share today.

Chutes is partnering with a research team from Harvard University to push the boundaries of AI inference efficiency.

The team at Harvard, led by Professor Juncheng Yang @1a1a11a, is developing a new prefix caching algorithm designed to significantly accelerate inference while reducing hardware usage.

23

590

117

41

48K

minlanyu retweeted

Chutes

@chutes_ai

3 months ago

if you missed it, we're running a research collab with Harvard right now you can opt in and get 25% off your inference costs. all you have to do is switch your endpoint: https://t.co/wGa4hYvmTm → https://t.co/SQZ09aZyR8 same models, same API, nothing else changes. you just pay less. your data goes to Harvard's team to help build a caching algorithm that'll make inference faster and cheaper across the whole platform once it ships. just know that your prompts and responses are recorded on this endpoint, so keep anything sensitive on https://t.co/wGa4hYvmTm like normal. it's live now and already working.

4

126

21

1

7K

Minlan Yu @minlanyu

4 months ago

RT @ZhentingQi: New CCA version + SWE task runner released! Paper: https://t.co/pvFXeD1XJj Code: https://t.co/q3aOTWRKaC SWE-Bench-Pro re…

0

1

0

77

Who to follow

Francis Y. Yan

@FrancisYan_

Assistant Professor at UIUC CS | Previously @MSFTResearch @Stanford | Creator of Puffer (400k users) | Illinois Networked Systems + AI Lab.

Yiying Zhang

@yiying__zhang

Founder and CEO of GenseeAI, Associate Professor of Computer Science at UCSD. LLM serving, AI Workflows, Agents

Aditya Akella

@adityaakella

Regents Chair in Computer Sciences, UT Austin. Computer Systems and Networking.

minlanyu retweeted

Zhenting Qi

@ZhentingQi

5 months ago

Agent scaffolding matters as much as, or even more than, raw model capability for hard agentic tasks. In our latest research with @Meta, we show that carefully designed scaffolding achieve 54.3% (Claude Opus) and 52.7% (Claude Sonnet) on SWE-Bench-Pro, compared to a 52.0% Claude Opus' result under a proprietary scaffold @claudeai.

ZhentingQi's tweet photo. Agent scaffolding matters as much as, or even more than, raw model capability for hard agentic tasks.

In our latest research with @Meta, we show that carefully designed scaffolding achieve 54.3% (Claude Opus) and 52.7% (Claude Sonnet) on SWE-Bench-Pro, compared to a 52.0% Claude Opus' result under a proprietary scaffold @claudeai.

27

476

81

464

97K

Minlan Yu @minlanyu

7 months ago

Congratulations to @laochonlam on receiving Google PhD fellowship this year! https://t.co/YXLXDWgiNF @hseas

0

3

0

246

minlanyu retweeted

Rana Shahout @rana_shahout

9 months ago

New paper at #Neurips2025 Compound AI systems don’t rely on a single model: they connect LLMs with tools, plugins, and APIs. But this creates chaos: - A “short” request can stall forever if the API is slow. - A “long” request might finish fast if the tool answers instantly.

1

7

2

0

328

minlanyu retweeted

Francis Y. Yan

@FrancisYan_

12 months ago

🚀 [OSDI ’25, Tue 11:10am] How do you “divide and conquer” large-scale resource allocation problems like GPU cluster scheduling or WAN traffic engineering? Our answer: “decouple and decompose” the underlying optimization using DeDe. (1/3)

FrancisYan_'s tweet photo. 🚀 [OSDI ’25, Tue 11:10am]
How do you “divide and conquer” large-scale resource allocation problems like GPU cluster scheduling or WAN traffic engineering? Our answer: “decouple and decompose” the underlying optimization using DeDe. (1/3) https://t.co/OxNiWWA6R2

4

51

5

12

4K

minlanyu retweeted

Princeton Computer Science @PrincetonCS

about 1 year ago

Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉 Li joined @Princeton in 1986 and has made important contributions to several research areas in computer science. https://t.co/gYVjWnHX8J

PrincetonCS's tweet photo. Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉

Li joined @Princeton in 1986 and has made important contributions to several research areas in computer science.

https://t.co/gYVjWnHX8J https://t.co/1lyoJA0uLW

1

108

16

12

15K

minlanyu retweeted

Ayush Noori @ayushnoori

about 1 year ago

We are presenting “Prefix and output length-aware scheduling for efficient online LLM inference” at the ICLR 2025 (@iclr_conf) Sparsity in LLMs workshop (@sparseLLMs). 🪫 Challenge: LLM inference in data centers benefits from data parallelism. How can we exploit patterns in requests – like shared prefixes and variable decode length – to optimally assign requests to GPU workers? 💡 Idea: both prefix and output length-aware scheduling! We build on Preble (ICML 2025, @vikranth22446, @yiying__zhang), which was the first distributed LLM serving system to exploit prompt sharing (see https://t.co/BpHl8PbbfQ). In our proof-of-concept work, we carefully benchmark Preble vs. prefix-unaware schedulers to identify opportunities for performance improvement. ⭐️ By adding output length-aware scheduling to Preble, we reduce latency by 14.31% at 64 RPS and 28.89% at 128 RPS. ⭐️ 📖 Full paper here: https://t.co/1iW6hlfK7g Thank you to co-authors @InakiArango, @YepHuang, @rana_shahout, and @minlanyu at @hseas. Thank you also to the Preble authors for their groundbreaking work!

ayushnoori's tweet photo. We are presenting “Prefix and output length-aware scheduling for efficient online LLM inference” at the ICLR 2025 (@iclr_conf) Sparsity in LLMs workshop (@sparseLLMs).

🪫 Challenge: LLM inference in data centers benefits from data parallelism. How can we exploit patterns in requests – like shared prefixes and variable decode length – to optimally assign requests to GPU workers?

💡 Idea: both prefix and output length-aware scheduling!

We build on Preble (ICML 2025, @vikranth22446, @yiying__zhang), which was the first distributed LLM serving system to exploit prompt sharing (see https://t.co/BpHl8PbbfQ).

In our proof-of-concept work, we carefully benchmark Preble vs. prefix-unaware schedulers to identify opportunities for performance improvement.

⭐️ By adding output length-aware scheduling to Preble, we reduce latency by 14.31% at 64 RPS and 28.89% at 128 RPS. ⭐️

📖 Full paper here: https://t.co/1iW6hlfK7g

Thank you to co-authors @InakiArango, @YepHuang, @rana_shahout, and @minlanyu at @hseas. Thank you also to the Preble authors for their groundbreaking work!

0

20

5

2

2K

Minlan Yu @minlanyu

about 1 year ago

For data center operators, this course will help explore scheduling strategies that would allow faster integration to the grid, as well as faster construction of data centers.

0

1

0

137

Minlan Yu @minlanyu

about 1 year ago

This course includes two parts: Power systems and AI data center systems. For data center device vendors, this course will help understand and increase the value of their design knobs for data centers in the energy market.

Minlan Yu @minlanyu

about 1 year ago

Excited to co-lead with @Le_Xie_Energy a new short course at Harvard @hseas on May 21: Power Systems and AI: An Introduction. We'll explore how AI + systems thinking can drive more sustainable, efficient datacenter and grid operations. Join us in Allston. https://t.co/v2DV6PGhJ6

0

5

2

1

1K

1

4

0

3

406

Minlan Yu @minlanyu

about 1 year ago

Excited to co-lead with @Le_Xie_Energy a new short course at Harvard @hseas on May 21: Power Systems and AI: An Introduction. We'll explore how AI + systems thinking can drive more sustainable, efficient datacenter and grid operations. Join us in Allston. https://t.co/v2DV6PGhJ6

0

5

2

1

1K

minlanyu retweeted

Alan Liu 🏍️ @Alan_Lau

over 1 year ago

📢If your autonomous systems use OctoMap as the 3D mapping, stay tuned for OctoCache that accelerates OctoMap by up to 3.0× w/o GPU. It is developed by my student @Wilhelm_Chen and an awesome team minhao, zishen, yushun, w/@minlanyu, @profvjreddi, and will appear at ASPLOS'25!

Alan_Lau's tweet photo. 📢If your autonomous systems use OctoMap as the 3D mapping, stay tuned for OctoCache that accelerates OctoMap by up to 3.0× w/o GPU. It is developed by my student @Wilhelm_Chen and an awesome team minhao, zishen, yushun, w/@minlanyu, @profvjreddi, and will appear at ASPLOS'25! https://t.co/ypCDkIcljj

1

15

2

1

1K

minlanyu retweeted

Rana Shahout @rana_shahout

over 1 year ago

New paper at #ICLR2025! Fast LLM inference = smart scheduling 🕒 but size-based scheduling (prioritizing short requests over long ones) requires knowing request sizes—a challenging task in LLM systems. So, how can we predict request sizes accurately? 🔗https://t.co/VldoHSnozC)

3

64

14

27

7K

minlanyu retweeted

Yang Zhou @yangzhouy

over 1 year ago

It is graduate application season (and only one week due)! Come join me at UC Davis CS to research ML systems and networked systems, and enjoy the beautiful landscapes around the Davis campus (Lake Tahoe, Napa)! Know more at https://t.co/CaapdoF0lQ! @UCDavisCOE @ucdavis

0

29

8

3

3K

Minlan Yu @minlanyu

over 1 year ago

Harvard CS is hiring this year: https://t.co/x66lUJII2s

1

96

18

40

20K

minlanyu retweeted

Gianni Antichi @get_gianni_up

almost 2 years ago

"F3: Fast and Flexible Network Telemetry with an FPGA coprocessor" got into ACM CoNEXT! We show how an FPGA placed alongside the switching ASIC enables flexible network monitoring through partial reconfiguration! Work with folks at Harvard, Alibaba, UCL, Purdue and Meta.

0

37

1

2K

minlanyu retweeted

Muhammad Shahbaz @msbaz2013

about 2 years ago

We at the ACE Center (@ace_computing) shared our vision with the Computer Architecture community. Please give it a read! https://t.co/vhTprgBFwV

0

12

4

2

2K

minlanyu retweeted

Francis Y. Yan

@FrancisYan_

about 2 years ago

Grace Liu, @yangzhouy, and I are co-chairing the SIGCOMM 2024 Artifact Evaluation. We are looking for PhD students, postdocs, and early-career researchers to join the committee! Underrepresented groups are strongly encouraged to apply. - Application deadline: June 12, 2024 (AoE) - Self-nomination form: https://t.co/dFFLZsj6DL

0

31

9

5

8K

Minlan Yu

@minlanyu

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users