We just launched a new version of @Macroscope code review that finds significantly more bugs, including 71% more critical issues, 27% more high-severity issues, 50% more medium-severity issues, and 31% more valid issues overall in our benchmarks, at the expense of a modest increase in false positives and review latency.
imo one of the most underrated @Macroscope features is Approvability, which auto-approves PRs that are bug-free and have a low blast radius. Across ALL of our customers, Macroscope auto-approves 34% of all PRs (up from <5% in January).
I was honestly blown away when we first pulled these numbers. That’s a huge % of PRs that devs don’t need to waste time reviewing/stamping.
does anybody know what happened to poco dolce chocolates? best chocolate I’ve had in sf (h/t @jack) but they’ve been sold out every time I’ve checked in the last like 4 years.
also accepting recs for anything on par
You can totally point it to Agents.md and have the actual policy details live there if you want! Our front matter is still helpful because you can configure deterministic gates (e.g. wait for certain other Checks to finish first before making the approvability determination) and enable other tools the agent should have access to (e.g. launchdarkly, mcps, etc).
still dreaming of the perfect email client.
what i want: the performance and reliability of Mimestream, with the power-user shortcuts of Superhuman, with the AI smarts of Notion Mail (rip), with the mobile composition and voice capabilities of Avec.
This is a love letter to @temporalio. Not because of the product (which powers a big chunk of @Macroscope's infrastructure and we love), but because of a tiny purple tardigrade.
A few months ago I spoke at Temporal’s Replay conference in SF. Afterwards, I grabbed a swag bag on the way out. Inside was a plush tardigrade. I’m embarrassed to say I didn’t know what a tardigrade even was before this. Though it was mythical? Wtf does it even have to do with Temporal. I still don’t know.
That plush somehow became my almost-2-year-old son’s most prized possession. He carries it everywhere. Sleeps with it. Calls it “tar-da-grade,” which is remarkable considering that’s about 25% of his vocabulary.
A few months later, disaster struck. One of our dogs chewed the tardigrade to pieces. Our son was devastated.
“Where’s tardigrade?”
“He’s… at the doctor.”
I tried superglue surgery before realizing maybe glue fumes and toddler bedtime weren’t an ideal combination. I searched everywhere online but couldn’t find another one.
So Sunday night I sent a slightly ridiculous message to our Temporal AE, Zac Bischoff, asking if there was any chance he could help a desperate dad.
The next day, not one but TWO replacement tardigrades showed up at my office. Durable execution ftw.
Thank you Zac, and thank you Temporal. You made a 2yo happy (albeit somewhat oblivious) and a father even happier.
Aside from the obvious value of improving bug detection, there’s an interesting shift afoot. When we first shipped @Macroscope's code review feature, *all* of our customers cared deeply minimizing false positives during code review. This made sense total sense. When humans are reviewing every PR comment in GitHub, the cost and annoyance of a false positive is very high because a human dev is spending time invaliding an issue.
But our most bleeding edge customers have changed how they work: coding agents are increasingly the first-line reviewers of PR comments, not humans. This changes the bias around the trade-off between precision and recall for code review because the effort of validating/invaliding a review comment is much lower.
This, combined with flexible controls like the Detection Mode setting we shipped (lets customers choose whether they want a precision or recall oriented code review) gave us new latitude to build a much more aggressive variant of our agentic pipeline, optimizing for much higher bug detection. And the results are palpable: 71% more critical issues, 27% more high-severity issues, and 31% more valid issues overall.
90% of the customers in our early beta strongly preferred this new version of our pipeline, citing the increased coverage in bug detection well worth the modest increase in false positives. For customers who still prefer a precision-oriented experience, it’s still a simple setting a way!
We just launched a new version of @Macroscope code review that finds significantly more bugs, including 71% more critical issues, 27% more high-severity issues, 50% more medium-severity issues, and 31% more valid issues overall in our benchmarks, at the expense of a modest increase in false positives and review latency.
really proud of our code review team for eeking out such meaningful performance gains. our code review pipeline is already extremely optimized with best in class performance across precision/recall/latency so driving this level of improvement is non-trivial and takes real engineering.
We just launched a new version of @Macroscope code review that finds significantly more bugs, including 71% more critical issues, 27% more high-severity issues, 50% more medium-severity issues, and 31% more valid issues overall in our benchmarks, at the expense of a modest increase in false positives and review latency.
to give customers control over these tradeoffs, we added a new setting that lets you choose whether Macroscope should prefer coverage or precision when reviewing code.
prefer coverage: catches the most bugs, but with more comments and more false positives (85% precision in our benchmarks). we recommend this if you’re using coding agents to validate PR comments.
prefer precision: prioritizes high-confidence findings (97% precision in our benchmarks). you’ll catch fewer bugs than coverage mode, but it’s still best-in-class versus other code review tools.
you can configure this per repo or per dev. so your repo can default to prefer precision, while an individual dev opts into prefer coverage for their own PRs.