Ryan Smith

about 2 months ago

It's pretty wild to think that a year ago CLI-first code tools didn't really exist, and now here we are measuring the entire E2E effectiveness of LLMs in software development

about 2 months ago

The software factory is already here. We're seeing bots write code, bots review it, and humans reduced to dispatching the next tool in the chain. Using 500k+ PRs from Code Review Bench, we looked at one question: can the human leave the loop yet?

13

66

8

22

48K

0

11

0

1

42

about 2 months ago

@Narmeen29013644 Incredible work Narmeen!

0

1

0

7

about 2 months ago

I've been so excited to see Mech Interp techniques actually generalize to improving real-world applications, and it feels like we're finally at that point!

about 2 months ago

We’re open-sourcing a new tool to control how LLMs behave: k-steering. In just 10 lines of code, you can control multiple aspects of LLM behavior at the same time without any fine-tuning or prompt engineering. Here's how 👇

2

58

8

40

12K

0

10

0

71

rnsmith49 retweeted

CodeRabbit

@coderabbitai

2 months ago

Open Source Models you have been sleeping on! https://t.co/xmEDm9bFbq

1

18

4

2

2K

2 months ago

@withmartian Humans in this benchmark are built better than me, because I am definitely losing vs a crocodile

0

2

0

13

2 months ago

I unironically love the Lotka-Volterra equations, so yes I may spend my Wednesday morning diving into the code of an April Fools’ post

2 months ago

Traditional Precision-Recall curves tell you how your code review tool performs on static benchmarks. They don't tell you how it performs against a Hawk. Introducing Fight Index (FI).

2

39

3

8

7K

0

11

0

59

3 months ago

@mwilliammyers Find out next time... on CRB v0.4!

0

2

0

35

3 months ago

Personally, I love the shift in the industry of optimizing for two separate workflows - fast and iterative, vs slow and offline. And I'm always a sucker for seeing data back up intuition 😅

Ashley Zhang @AshleyZhang110

3 months ago

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

8

54

6

25

40K

1

9

0

101

rnsmith49 retweeted

Fazl Barez @FazlBarez

3 months ago

If this policy is not revoked, I won’t be reviewing/ACing for #NeurIPS Science requires open exchange of ideas! When participation gets shaped by geopolitics, it ends up reflecting power structures, not merit--narrows what science can be and powerful nations get full control!

5

248

19

12

25K

rnsmith49 retweeted

3 months ago

I've been playing around with eval-ing AI code review tools at work. We track 22 different ones. @greptile V4 had the single biggest improvement I've ever measured. Recall increased 47% from 38.7 → 56.9%

5

61

9

20

7K

rnsmith49 retweeted

Shriyash Upadhyay

@shriyashku

3 months ago

Our code review tracker caught the release of Claude Code Review before @AnthropicAI announced it. Greptile v4 hit #1 on CRB. The tracker caught it before their announcement. The data is predicting something new from @Devin Review in the near future. Here's how. 🧵

3

21

5

4

1K

3 months ago

Coding tools are getting way better, but also way more convincing to humans (or me at least). We definitely need to continue building robust evals for code review so we know we are keeping the "LGTM effect" in check

3 months ago

How good is Claude Code Review really, and is it worth $25+ per review? We scraped every OSS repo on GitHub that's using it to figure out how devs actually use it. Here's how it stacks up against 22 other tools: https://t.co/iAZDURyqol Featuring: @augmentcode @baz_scm @CodeAntAI @coderabbitai @cognition @cubic_dev_ @cursor_ai @GeminiApp @greptile @kilocode @kodustech @mesa_dot_dev @QodoAI

24

111

26

51

49K

1

10

0

1

115

3 months ago

@joshgreaves_ml frfr f(or)r(ust) when?

1

0

75

3 months ago

Verification is easier than Generation - the same is true for Code, and I am really excited to see the pipeline of: Code Review getting more robust evals → better code review tools → better coding tools

3 months ago

Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights 🧵👇 Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI

45

361

57

179

219K

0

13

0

212

rnsmith49 retweeted

4 months ago

A new ARES tutorial from @Narmeen29013644: Getting started in long-horizon interp. When do agents fail to accurately model their environment? How de we fix them? And how can you run these experiments on your own machine?

1

21

3

11

1K

4 months ago

@alexML @withmartian My favorite meetings are the ones where by step 2 you can tell you're allowed to tune out though

0

1

0

24

Narmeen Oozeer @Narmeen29013644

4 months ago

This is one of the biggest takeaways for me - model internals change over steps after interacting with the environment! I love this other figure @Narmeen29013644 made that shows this too - a matrix of cosine similarity between optimal steering vectors at each step:

rnsmith49's tweet photo. This is one of the biggest takeaways for me - model internals change over steps after interacting with the environment! I love this other figure @Narmeen29013644 made that shows this too - a matrix of cosine similarity between optimal steering vectors at each step: https://t.co/fzSbePlMjY

4 months ago

But you can't just compute one steering vector and reuse it for the whole episode. The representation of "valid vs. invalid" drifts as the conversation goes on;per-step vectors outperform a single static one. PCA on the vectors at different time steps shows they point in genuinely different directions.

Narmeen29013644's tweet photo. But you can't just compute one steering vector and reuse it for the whole episode. The representation of "valid vs. invalid" drifts as the conversation goes on;per-step vectors outperform a single static one.

PCA on the vectors at different time steps shows they point in genuinely different directions.

1

4

0

280

1

10

0

1

87