In that benchmark comparison, do you even have the sample size to compare two models, or are you making decisions based on statistical noise?
2605.30315v1 offers a simple test
Outrider added it to our fork of lm-evaluation-harness
Install Outrider: https://t.co/apCnaQRZAW
In that benchmark comparison, do you even have the sample size to compare two models, or are you making decisions based on statistical noise?
2605.30315v1 offers a simple test
Outrider added it to our fork of lm-evaluation-harness
Install Outrider: https://t.co/apCnaQRZAW
@artists_voyage Especially insidious for segment-level error analysis
You slice into subgroups to improve over a baseline, the CIs widen with each cut, and decisions made off the point estimates end up chasing noise.
Now @remyxai Outrider is on @github Marketplace!
Schedule Claude to implement core methods from the most relevant papers for your repo using Github Actions!
https://t.co/JqE1DKegF5
'CrossView Suite' introduces CrossViewBench, focusing on explicit alignment mechanisms and object-level consistency across views offers a strong framework for evaluating the fidelity of VQASynth's synthetic data and improving the robustness of the generated spatial questions.
@remyxai@AnthropicAI Coming soon: evals on every PR. Your benchmark suite + datasets run against the diff, results linked to PR before setting ready for review.
Design partner pilot opens soon — DM if interested.
Under the hood: arXiv → @RemyxAI ranks weekly against your team's commit history → @AnthropicAI Claude Code drafts the integration → Outrider opens a draft PR with tests
When code evolves, developers signal an implicit preference for the new over the old.
Scale that analysis across many repos and patterns emerge.
Taste is learnable too, even if OAI hasn't figured out selection yet.
Uses a Gaussian Process to learn contributor preferences implicitly from repo merge history
Next, I applying the GP to synthesize a larger volume of preference data to help finetune an open-weight coding model with DPO and LoRA.
https://t.co/8bqDhtqdTx
The space of possible improvements to your AI model is large while evaluation is costly
LILO learns efficiently from decision maker's preferences, balancing exploration and exploitation in a principled way w/ Bayesian Optimization
https://t.co/rpW3323Col