Introducing Repo2RLEnv
Turn any repository into runnable, verifiable coding environments built from real PRs and commits for coding-agent evaluation or RL training
> uv pip install repo2rlenv
@JoshPurtell@carrynointerest FWIW, VCs have historically been one of the biggest champions of OSS as well, for many reasons including cost (see work posted by A16z people for example). VC Strat is not a monolith
Building autonomous agents for scientific discovery? 🧬🤖
@GoogleDeepMind Science Skills is now available on GitHub. We've open-sourced this specialized toolkit to accelerate your agentic workflows with scientific grounding and higher token efficiency.
Download now ↓
https://t.co/cwp1HOeKvo
Did you know that whether or not your benchmark dataset is private or public has little bearing on how fast it saturates? In our ICML 2026 paper, we look into that hypothesis (and more), and provide a comprehensive analysis into why benchmarks saturate.
Read the paper! 👇
🚨 As AI models improve, many benchmarks are becoming saturated and losing their ability to distinguish between models. 🚨
Check out our new @icmlconf paper: “When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation”
Hugging Face is the home for AI & ML across every domain, including biomedical!
The @NIH just added the @huggingface Hub to its official list of Generalist Repositories for data sharing.
NIH-funded? You can point to the Hub in your data sharing plan 🤗
Weekend mini project! Since commentary on AI is inherently interdisciplinary, we connected the observations in the @Pontifex's encyclical with decades of scholarship in Responsible AI and Ethics research and created an interactive space with these annotations!
Work with Ian Reynolds, @YJernite, and @mmitchell_ai
Lots to unpack. We started with 105 annotations. Please submit pull requests for more that we may have missed!
https://t.co/1aq3rCfdGQ
@trydotworks@natolambert This makes no sense? There are several companies who have never open sourced a model even when they were first starting out. It’s fine to optimize for reach and profits it’s just that open science doesn’t have those same motivations for the people pushing it 🤷♂️
@AndrewCurran_ Sometimes its funny to find edge cases -- Claude 4.7 Opus struggled hard with a duckdb js typecasting issue that GPT 5.4 oneshotted for example. The spiky intelligence theory continues to hold
I feel like the current state of LLM evals overindexes on prompts/dataset and underindexes on metrics. You used to be able to measure something abstract like “fairness” in great many ways using the same overused COMPAS dataset, with often highly contextualized beautifully designed metrics. Now it’s all exact matches and pass@k (basically different flavors of accuracy) and that’s mostly it. I do see cost and time measurements but those I would argue are properties of the system itself than of the interaction being mesured.
Of course these evals will get saturated, you’re never measuring anything novel!
Bring back measurement science to actually measure behavior, instead of engineering new prompts and then immediately discarding good data on the first sign of saturation :)
I said this to @citrini last night, but in the future, will we really need storage?
I take a ton of photos of my kids, and they are on my phone and in a cloud. But in the future, won't I just tell a model "generate a photo from my son's 7th birthday" and it'll be just as good?
Some of these new job descriptions I see are frankly insane? “hard science” “AI-pilled”etc. what happened to humility? leadership? willingness to learn?
In the modern workplace, AI agents are putting humans into their own “productivity” islands disconnected from each other 🫠
@LChoshen@_lewtun@andrewwhite01 We had this! But yes could be cool? I just really don’t know why people keep doing chart crimes despite all the tweets about them, feels like a social experiment at this point
https://t.co/6gJbbAZEdU