Shriyash Upadhyay

about 2 months ago

@coderabbitai @withmartian Had a lot of fun on the podcast!

1

6

0

135

about 2 months ago

Awesome work by @AshleyZhang110 on how we can use code review to measure the progress of the software factory. Short version: what kinds of repos are seeing automatic generation and review of code? Just individuals? Teams working on production software? Can humans step away?

about 2 months ago

The software factory is already here. We're seeing bots write code, bots review it, and humans reduced to dispatching the next tool in the chain. Using 500k+ PRs from Code Review Bench, we looked at one question: can the human leave the loop yet?

13

66

8

22

48K

0

13

1

0

525

about 2 months ago

Based on a cool paper: https://t.co/CqBrl3d8s5

about 2 months ago

We’re open-sourcing a new tool to control how LLMs behave: k-steering. In just 10 lines of code, you can control multiple aspects of LLM behavior at the same time without any fine-tuning or prompt engineering. Here's how 👇

2

58

8

40

12K

0

7

0

1

118

Who to follow

Understanding Intelligence. Measurement. Explanation. Application. That's how we're tackling AI interpretability: the greatest scientific problem of our age.

Intel Chen

@intelchentwo

Building @IrisFinanceCo . ex- @GalaxyDigitalHQ | Infra | Data. Penn @pennmandt 2x' . Photography 📸 and Cooking 🥘, too.

Xiuyu Li

@xiuyu_l

ex: Coding Agent RL & Infra @xAI, @NVIDIA, @AIatMeta, @berkeley_ai, @Cornell. Views are my own.

2 months ago

OOD Evals

2 months ago

Traditional Precision-Recall curves tell you how your code review tool performs on static benchmarks. They don't tell you how it performs against a Hawk. Introducing Fight Index (FI).

2

39

3

8

7K

1

9

0

1

164

shriyashku retweeted

Cognition @cognition

3 months ago

We're happy to announce our collaboration with @withmartian on Code Review Bench v0.3, with a focus on the tradeoffs between precision and latency.

2

87

12

15

14K

3 months ago

@alexML @withmartian I like to think of myself as a *very deep* review agent

1

3

0

40

3 months ago

First was codegen, now code review. Every product category will have background agents. Tools in most fields talk about augmenting humans, but that’s a bad design pattern. It encourages humans to be the bottleneck. Things will just happen in the background, automatically

3 months ago

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

8

54

6

25

40K

0

11

1

0

472

3 months ago

@ryan_tech_lab @withmartian @augmentcode @baz_scm Right now it's "an LLM reads through and determines the severity once we know there's an issue". But that's where the calibration comes in

1

0

26

3 months ago

@Piyushkumar420 @withmartian @augmentcode @baz_scm The original blog post is also good: https://t.co/RXtu9vpXcG

0

12

3 months ago

@Piyushkumar420 @withmartian @augmentcode @baz_scm You should read the methodology here!: https://t.co/QsqrG4M9dc

1

0

23

Ashley Zhang @AshleyZhang110

3 months ago

@ryan_tech_lab @withmartian @augmentcode @baz_scm This is a great point! We actually do have severity labels on the data too, which you can play around with. Here's the plots for critical bugs: https://t.co/0GxSI0I6mW We still want to calibrate our severity classifier more carefully though, so take this as directional

1

2

0

45

shriyashku retweeted

3 months ago

I've been playing around with eval-ing AI code review tools at work. We track 22 different ones. @greptile V4 had the single biggest improvement I've ever measured. Recall increased 47% from 38.7 → 56.9%

5

61

9

20

7K

3 months ago

@alexML @AnthropicAI @devin Damn, should have thought of that. I'm a @Kalshi man myself tho

0

2

0

50

3 months ago

Our code review tracker caught the release of Claude Code Review before @AnthropicAI announced it. Greptile v4 hit #1 on CRB. The tracker caught it before their announcement. The data is predicting something new from @Devin Review in the near future. Here's how. 🧵

3 months ago

How good is Claude Code Review really, and is it worth $25+ per review? We scraped every OSS repo on GitHub that's using it to figure out how devs actually use it. Here's how it stacks up against 22 other tools: https://t.co/iAZDURyqol Featuring: @augmentcode @baz_scm @CodeAntAI @coderabbitai @cognition @cubic_dev_ @cursor_ai @GeminiApp @greptile @kilocode @kodustech @mesa_dot_dev @QodoAI

24

111

26

51

49K

3

21

5

4

1K

3 months ago

Checkout the tracker here: https://t.co/Jhh9CH6ots

0

3

0

1

70