Stephan Xie @stephofx - Twitter Profile

Pinned Tweet

about 1 month ago

How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈? On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy. 🧵1/

stephofx's tweet photo. How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈?

On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy.

🧵1/ https://t.co/pNN2MLphyF

1

41

15

5

4K

Stephan Xie @stephofx

about 1 month ago

Check out our blog post on ARFBench here!

ML@CMU @mlcmublog

about 1 month ago

https://t.co/GZO4Z8jhFn How good are AI systems at time-series Q&A? On ARFBench, top models hit ~63% on real incident data. But they miss different things than humans; combine both and accuracy jumps to 87%. Read more in our latest blog post!

1

12

4

2

2K

0

9

0

2

979

Stephan Xie @stephofx

about 1 month ago

10/ Blog post: https://t.co/D8qv9iAv1S x-listed: https://t.co/6OyAXNrkS8 Paper: https://t.co/5TuLbo5RMC Dataset+Model+Leaderboard: https://t.co/4GRO92mmdI

0

2

0

1

98

Stephan Xie @stephofx

about 1 month ago

How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈? On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy. 🧵1/

1

41

15

5

4K

Who to follow

Ting Su

@su_tingsu

Professor of computer science @ECNUer, previously postdoc @ETH_en and @NTUsg; focus on SE/PL/Security

Sachin Goyal

@goyalsachin007

Pretraining @ Anthropic | Past: PhD @ CMU MLD, intern at Meta, Google and MSR | UG: IIT Bombay

Trinidad Luna III

@TrinidadLuna10

🇲🇽| 🇺🇸 || 📸: trinidadthethird 👻: manontheluna.

Stephan Xie @stephofx

about 1 month ago

9/ This was a very fun and insightful collaboration between Datadog AI Research and collaborators at CMU, including Ben, @MononitoGoswami , @JunhongShen1 , Emaad, Chenghao, David, @ThisIsOthmane , and my advisor @atalwalkar.

1

3

0

170

stephofx retweeted

maxwell jones @maxwell54650346

3 months ago

Video Editing is great - but what if you want to apply an effect to your input video described by another video?? Introducing RefVFX, the first method that takes in both an input video and a reference effect video for generative video editing!

6

116

23

69

21K

stephofx retweeted

Fahim Tajwar @FahimTajwar10

4 months ago

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: https://t.co/j9BCBF7K3R 🧵 1/n

14

806

161

728

208K

stephofx retweeted

Sang Michael Xie

@sangmichaelxie

4 months ago

Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL. I think this will let us reuse previous RL and sampling FLOPs much more efficiently in the future - just check out PrefixRL’s 2x compute efficiency gain and huge plateau increase over SFT then RL. https://t.co/MYdhEEpx61

sangmichaelxie's tweet photo. Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL.
I think this will let us reuse previous RL and sampling FLOPs much more efficiently in the future - just check out PrefixRL’s 2x compute efficiency gain and huge plateau increase over SFT then RL.

https://t.co/MYdhEEpx61

2

192

26

175

23K

stephofx retweeted

Yuda Song @yus167

4 months ago

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

yus167's tweet photo. RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong."

Can we train LLMs on this human-AI interaction?

We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵 https://t.co/i8ncPFKq70

14

598

102

495

107K

stephofx retweeted

Valerie Chen

@valeriechen_

7 months ago

Understanding how humans fit into agent workflows is essential, but we still lack concrete ways to measure collaboration. Our Collaborative Effort Scaling framework introduces metrics grounded in real-world studies and simulations. More details below👇

1

24

4

3

4K

stephofx retweeted

Yuda Song @yus167

8 months ago

🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)

yus167's tweet photo. 🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations.
How should we design algorithms under this partial observability?
Should we decide (end-to-end RL) or distill (from a privileged expert)?
We study this trade-off in locomotion. 🧵(1/n) https://t.co/IEWVGrPsOx

2

141

40

66

31K

stephofx retweeted

Emily Byun

@yewonbyun_

8 months ago

💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data

yewonbyun_'s tweet photo. 💡Can we trust synthetic data for statistical inference?

We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data https://t.co/WiHkcR0GmO

2

144

36

84

31K

Stephan Xie @stephofx

about 1 year ago

@steph_milani @JHUCompSci @NYU_Courant Huge congrats Steph!! Super exciting!!

1

0

95

Stephan Xie @stephofx

about 1 year ago

Hard to overstate how important observability data is in forecasting! The complex nature of the data led to huge challenges in even evaluating time series models but also helped us make Toto super capable. Excited to share this work led by Ben and Emaad at Datadog AI Research!

Ameet Talwalkar

@atalwalkar

about 1 year ago

I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵

atalwalkar's tweet photo. I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵 https://t.co/vrDSadHdQz

5

242

52

213

38K

0

14

4

1

1K

Stephan Xie

@stephofx

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users