Safety is being subsumed to capitalism because when there is misalignment between the two, capitalism wins.
The only way to make sure AI is safe is to make a strong capitalist case for the technologies that will make AI safe: creating an economic incentive to understand models.
Awesome work by @AshleyZhang110 on how we can use code review to measure the progress of the software factory.
Short version: what kinds of repos are seeing automatic generation and review of code? Just individuals? Teams working on production software?
Can humans step away?
The software factory is already here.
We're seeing bots write code, bots review it, and humans reduced to dispatching the next tool in the chain.
Using 500k+ PRs from Code Review Bench, we looked at one question: can the human leave the loop yet?
Weβre open-sourcing a new tool to control how LLMs behave: k-steering. In just 10 lines of code, you can control multiple aspects of LLM behavior at the same time without any fine-tuning or prompt engineering.
Here's how π
Traditional Precision-Recall curves tell you how your code review tool performs on static benchmarks.
They don't tell you how it performs against a Hawk.
Introducing Fight Index (FI).
First was codegen, now code review.
Every product category will have background agents.
Tools in most fields talk about augmenting humans, but thatβs a bad design pattern. It encourages humans to be the bottleneck. Things will just happen in the background, automatically
We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review":
β Standard AI review: PR-level, fast, human in the loop
β Deep Review: repo-wide context, runs autonomously in the background
π§΅π
@ryan_tech_lab@withmartian@augmentcode@baz_scm Right now it's "an LLM reads through and determines the severity once we know there's an issue". But that's where the calibration comes in
@ryan_tech_lab@withmartian@augmentcode@baz_scm This is a great point! We actually do have severity labels on the data too, which you can play around with.
Here's the plots for critical bugs: https://t.co/0GxSI0I6mW
We still want to calibrate our severity classifier more carefully though, so take this as directional
I've been playing around with eval-ing AI code review tools at work. We track 22 different ones. @greptile V4 had the single biggest improvement I've ever measured.
Recall increased 47% from 38.7 β 56.9%
Our code review tracker caught the release of Claude Code Review before @AnthropicAI announced it.
Greptile v4 hit #1 on CRB. The tracker caught it before their announcement.
The data is predicting something new from @Devin Review in the near future.
Here's how. π§΅
@Cognition's Devin Review is the fastest improving tool on the tracker. Since they released earlier this year, the tool has seen a 26 point increase in F0.5 score.
Theyβre now a top 3 tool in recall. Expect great things soon π
Every AI code review benchmark published so far has one thing in common: they were all made by vendors.
And somehow, their own tool always wins.
That just changed with the first independent benchmark.
Heres how we performed on real OSS PR's! π