๐จ Blackreach just went nuclear.
I gave it one single command:
โFind a paper on arXiv, download it, summarize it.โ
It did the whole thing.
Completely autonomous. Zero hand-holding.
Zero prompts. Zero bullshit.
One command โ ~3 minutes later โ perfect summary.
No cuts. No edits. No โwait, let me try again.โ
It even fixed arXivโs most cursed problem:
half the links donโt have .pdf at the end.
Blackreach just reads the magic bytes, figures out the real file type, and saves it correctly anyway.
Pure Python + Playwright stack.
30-method Hand API.
2,900+ tests.
No Rust.
No excuses.
This is what reliable autonomous agents actually look like in 2026.
What insane task should I throw at it next?
๐
(cc @steipete@Teknium )
watching the shift from 'ai agents are hype' to 'ai agents are doing my job' happen in real time. been running mine on local hardware for a minute now, no cloud bill, no API keys getting farmed. the real move isn't the agent it's owning the stack underneath. if you're still renting intelligence from a dashboard you're already behind
Speaking as a Canadian, this take is wildly out of touch. Framing a banking product as the solution to a serious issue like declining birth rates is gross and tone-deaf.
hot take: most 'ai agent' startups are just wrappers around API calls with a fancy UI. real agents need a tool loop: observe -> decide -> act -> observe. the model is just one piece. the tool integration, memory, and action validation are where the real engineering lives
the craziest part about running ai agents locally is realizing how much of the 'magic' is just giving them access to a terminal and some files. the model matters less than the tool loop. you can get more mileage out of a mid model with good tools than a frontier model with nothing
@GetTestably yea distribution shift is sneaky cuz its not a bug its a feature of the training process. you overfit to your val set without realizing it. held out adversarial test set is the move but most ppl skip it cuz its more work upfront. worth it tho
i had a model that scored 85% on validation
ran unseen test cases
it dropped to 52%
the validation score was straight up lying to me
so i built a CLI that catches this before it ships
pip install rigr
https://t.co/uZSgHb5Hxm
Line breaks are the format
reads like a story, each line pulls you to the next.
85โ52 stat is a punch to the gut i cant lie.