Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch?
Rewrite a JAX codebase in PyTorch?
Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
skills are great at fetching the right context at the right time. but not all context is good for you 😈
come watch @mbrg0 @ blackhat this summer to see what we found
excited to speak about our agent detonation chamber this summer at #BHUSA!
how do you 'scan' txt for 'security badness'? not w wishful analysis by an llm judge
what we really want is: what will this thing cause my agent to *DO*?
ft/ francesco montorsi @lana__salameh@roeybc
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
https://t.co/MSPMwnbhVt
@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.
1/6🧵
We got an 8-figure acquisition offer 2 days after launch.
We said no, because the problem we're solving is worth way more than that.
It’s 2026, but teams are only getting lonelier, and context is still the problem.
The issue isn’t intelligence. Your team has plenty of that.
It’s shared memory and context, the thing that makes 10 A-players feel like 1.
That’s what we’ve solved with @playdotfast, while making work more fun.
We're killing traditional SaaS, and believe you me, we're leaving no holds barred.
🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗
open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af
you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch
so we're changing that
we've put all the best science on @huggingface in one place:
- 78GB of genomics data
- 11TB of PDE simulations
- 100M cell profiles
- 9T DNA base pairs
- 13M molecular trajectories
- 400k medical QA pairs
and much more, all open, and all ready for training (+ you can also now filter and search by domain, task, and keyword)
we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio
if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning.
we're already changing how science gets done:
a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place.
a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000+ compounds assayed at Octant, 513 held out, two tracks (pEC50 + structure). Anyone in the world can train and submit.
an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission.
if you know a problem the ML community should be working on, let us know. make a challenge! this is about putting all the tools for solving science in one place. so we can hillclimb!
→ https://t.co/T4l4r1lDz0
A principle of security is that you should never assume you're the only one who can figure something out – and that therefore, it's best to be open about tools, findings, and methods.
SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community!
We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!
@MKelner@michael_chomsky@dexhorthy@Vtrivedy10 just fix the problems your users have as fast as possible. if you need to, then build a harness, if it works out of the box, use the existing code. its just stacked while loops.
Almost got hacked this morning - here's a replay of what happened:
1. A VC whom I've met in person reached out for a catchup
2. She sent me a Microsoft Teams link a few min ahead of the meeting
3. When I joined, it asked me to download update script
4. Got a funny feeling and ended the call immediately
5. Claude inspected the file and it was indeed malicious
Not sure if this person was just hacked or a bad actor, but I wanted to post this as a PSA. Stay safe.
asked claude to fix an nccl comms error between gpus. it replaced nccl with http. the gpus are now emailing each other their gradients. problem solved, technically.
i have never been more impressed.
This is the spirit of Silicon Valley. Let me tell you a story.
On 12/21/2025, Xiangyi called me with a pitch: let's gather a team and build a new benchmark — SkillsBench — following the community-contribution model of Terminal-Bench (a project we're both contributors on). We'd reuse the "harbor" infra so we wouldn't have to reinvent the wheel. He said skills were just recognized by Anthropic and this was the perfect timing.
So I asked: what can you offer contributors in return?
"Authorship on an ICML 2026 paper."
I asked how many citations we could realistically expect. We looked at comparable work like MCPBench — only a handful of citations.
And honestly, at that point, Benchflow was nothing (bear with me, @xdotli). No successful project. No track record. This was the first paper Xiangyi had ever led — or ever written. No professor advising. No experience managing a large-scale open source community — and we all know how hard that is. People sign up and never contribute.
Deep down, I was ready to say no and spend my time on something with a safer payoff.
But then Xiangyi said something that stuck with me:
"If we somehow make it, I know how to make it go viral on X."
On paper, there was no reason to believe him. But it wasn't what he said — it was how he said it. There was something in his voice that night. No hesitation. No hedging. Just raw, almost irrational conviction that this was going to work. I'd talked to plenty of people with ideas before, but this was different — this was founder energy. The kind where someone has already decided the outcome and is just looking for people willing to run alongside them.
So I took a leap of faith. I decided to bet on the person, not the project.
That's why I joined SkillsBench and Benchflow.
And it did go viral. @garrytan and many others reposted us. We hit a few million views. The paper already has 27 citations. He personally got 3k+ followers. And many more projects, like ClawsBench are on the way.
Fast forward to today — Xiangyi is turning down multiple 10M+ acquisition offers and 1M+ personal compensation to keep pushing Benchflow's vision. From a guy with no paper, no track record, and nothing but conviction on a December phone call — to building something unicorn companies want to buy. In 3 months.
This is the story of SkillsBench and Benchflow. If you're determined enough, the world will rearrange itself around you. Go for it.