Ryan Shar @RyanShar01 - Twitter Profile

about 1 month ago

I will be presenting EDIT-Bench as an Oral at ICLR on Friday 4/23! Session 4D starts at 3:15 and the talk is at 3:39. We will also be at poster session 3 in the morning. See you all there!

0

32

8

0

4K

RyanShar01 retweeted

Wayne Chi

@iamwaynechi

4 months ago

New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵

19

254

27

179

26K

RyanShar01 retweeted

Wayne Chi

@iamwaynechi

7 months ago

Tired of evaluating LLMs on made-up problems that look nothing like real tasks? Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode. Real-world edits are challenging: 𝗼𝗻𝗹𝘆 𝟭/𝟰𝟬 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝗰𝗼𝗿𝗲 > 𝟲𝟬% 𝗽𝗮𝘀𝘀@𝟭.

iamwaynechi's tweet photo. Tired of evaluating LLMs on made-up problems that look nothing like real tasks?

Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode.

Real-world edits are challenging: 𝗼𝗻𝗹𝘆 𝟭/𝟰𝟬 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝗰𝗼𝗿𝗲 > 𝟲𝟬% 𝗽𝗮𝘀𝘀@𝟭. https://t.co/QIF6NtE3Lt

2

42

12

10

15K

RyanShar01 retweeted

Ameet Talwalkar

@atalwalkar

about 1 year ago

I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵

atalwalkar's tweet photo. I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵 https://t.co/vrDSadHdQz

5

242

52

213

38K

RyanShar01 retweeted

Valerie Chen

@valeriechen_

about 1 year ago

Blog post on @CopilotArena out now!

0

15

2

0

509

RyanShar01 retweeted

Wayne Chi

@iamwaynechi

over 1 year ago

What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint. Here's what we have learned /🧵

iamwaynechi's tweet photo. What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants?

In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint.

Here's what we have learned /🧵 https://t.co/mZIsMOY8Fe

3

160

32

125

71K

RyanShar01 retweeted

Jane Pan @JanePan_

over 1 year ago

When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6]

JanePan_'s tweet photo. When benchmarks talk, do LLMs listen?

Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks!

Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_!

[1/6] https://t.co/OYtuGYYpiq

2

54

15

14

11K

RyanShar01 retweeted

Misha Khodak @khodakmoments

over 1 year ago

🧵 on surprising revelations from our study of specialized foundation models (FMs beyond vision/text): after evaluating dozens of scientific & time series FMs we found that most weren’t even competitive with simple supervised models, some with as little as 513 parameters. 1/n

khodakmoments's tweet photo. 🧵 on surprising revelations from our study of specialized foundation models (FMs beyond vision/text): after evaluating dozens of scientific & time series FMs we found that most weren’t even competitive with simple supervised models, some with as little as 513 parameters.
1/n https://t.co/qIfYFBqyCF

3

243

62

186

43K

RyanShar01 retweeted

Arena.ai

@arena

over 1 year ago

Which model is best for coding? @CopilotArena leaderboard is out! Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes! Let’s discuss our findings so far🧵

arena's tweet photo. Which model is best for coding? @CopilotArena leaderboard is out!

Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!

Let’s discuss our findings so far🧵 https://t.co/gBJ8qXiTIy

17

531

77

192

136K

Ryan Shar

@RyanShar01

Last Seen Users on Sotwe

Trends for you

Most Popular Users