I will be presenting EDIT-Bench as an Oral at ICLR on Friday 4/23! Session 4D starts at 3:15 and the talk is at 3:39.
We will also be at poster session 3 in the morning.
See you all there!
New preprint alert ๐จ
Can LLM agents develop video games?
We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot.
We also present two simple multimodal feedback mechanisms that lead to immediate performance gains.
/๐งต
Tired of evaluating LLMs on made-up problems that look nothing like real tasks?
Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode.
Real-world edits are challenging: ๐ผ๐ป๐น๐ ๐ญ/๐ฐ๐ฌ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐๐ฐ๐ผ๐ฟ๐ฒ > ๐ฒ๐ฌ% ๐ฝ๐ฎ๐๐@๐ญ.
Iโm excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. ๐งต
What do developers ๐ณ๐ฆ๐ข๐ญ๐ญ๐บ think of AI coding assistants?
In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, weโre here to share our findings in our recent preprint.
Here's what we have learned /๐งต
When benchmarks talk, do LLMs listen?
Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks!
Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_!
[1/6]
๐งต on surprising revelations from our study of specialized foundation models (FMs beyond vision/text): after evaluating dozens of scientific & time series FMs we found that most werenโt even competitive with simple supervised models, some with as little as 513 parameters.
1/n
Which model is best for coding? @CopilotArena leaderboard is out!
Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!
Letโs discuss our findings so far๐งต