When building ColBERT, I assumed it will pave the way for hypernetwork-based, pruning-capable retrieval indexes. Let me explain.
The big insight in ColBERT is that we can encode each document upfront *not* into a vector, but into a rich scoring function, f: query -> float, which simultaneously supports pruning, so you can skip most computation.
In v1/v2, the choice of function was "a matrix + MaxSim". It showed that at inference time, we could do a lot better than dot products.
But in the future, the function could also be a small DNN constructed out of each document! The encoder is then a hypernetwork producing functions f with the same query -> float signature, allowing each document to decide its strategy for deciding if a query is relevant to it.
How to do this while allowing pruning (so that retrieval is sub-linear at scale) is a rich question you can steal if you're doing NLP systems or IR.
You are right @badlogicgames I copied codex exec_command and write_stdin into Pi Agent.
Then compared its performance to the plain bash tool. The result supprised me. Async bash almost lost in every task.
@deedydas Would love to know if the results change using different agents. They only tried using mini-SWE-agent.
@lateinteraction - wonder if dspy.RLM could have a crack at this.
Introducing ml-intern, the agent that just automated the post-training team @huggingface
It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem.
It can pull off crazy things:
We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%.
In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%.
For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on https://t.co/udm7xGpNzR, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously.
How it works?
ml-intern makes full use of the HF ecosystem:
- finds papers on arxiv and https://t.co/brvCC7fLPa, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on https://t.co/hrJuRkRyzi
- browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data
- launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains
ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like.
Releasing it today as a CLI and a web app you can use from your phone/desktop.
CLI: https://t.co/l3K1PslZ1n
Web + mobile: https://t.co/orko5srL4H
And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.
The new generation of open state-of-the-art single and multi-vector retrieval models is here
It's time, DenseOn with the LateOn 🎶
@LightOnIO releases models that leap past existing ones, and everything you need to do the same!
Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code. Just exactly what you want to see.
@eddiejiao_obj, @drewocarr and I built a prototype to see how this could actually work, and set out to make it real. We're calling it Flipbook. (1/5)
Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!
I'm not trying to misrepresent anyone, and perhaps my Googler friends are misinformed. But I strongly suspect that by my own notions of what constitutes advanced AI adoption--and indeed, what most of the industry would expect from Google right now--you are not doing great.
At Anthropic, which is basically the bar at this point, everyone is burning, I'd guess, 10M to 15M tokens a day. If Google can convince me that half their engineers are burning 4M tokens a day, then I'd be happy to post a retraction with an apology.
@DavidGFar This is awesome. How far can you take this?
Are we at a point where you could train on the Hermes agent traces (https://t.co/srVlfcSdyZ) to get a lightning fast routing head for an agent to select the right tool?
@BatsouElef https://t.co/eHEkP8Bl4N
I built a newsfeed for Substack that shows only long-form posts from the last 24 hours.
Already discovering way better writers.
"It seems to me that there will quickly reach a point where we can treat computers in much the same manner as we treat fellow humans, without ever assuming that they are human or should be.
For instance, I think it not unreasonable to ask a computer to understand me (maybe someday in natural language), to cooperate with me, to take some initiative on its own, and to make life simpler for me. It is reasonable for the computer to not understand occasionally, and to need clarification, or even for it to screw up and do as I said, and not what I meant."
- The Mind's I - Jan 21 1983 usenet
@ThePrimeagen I'm building https://t.co/QNCN8mkFlA
A newsfeed for Substack that shows only long-form posts from the last 24 hours.
Already discovering way better writers.