Karpathy is basically just a fun to watch celebrity but hasn’t done any meaningful work in a long time and hasn’t made valuable predictions about anything
Reasoning in Memory (RiM) enables latent reasoning without the need to "think out loud." By reasoning directly within a dedicated latent workspace—working memory—no overhead of generating explicit reasoning tokens. Dramatically faster inference with the same quality of reasoning.
🏹5 Days of Trajectory.
Day 3 - An Open Source Training Stack for Continual Learning
Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today.
Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone.
Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base.
The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards.
We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster.
We’re very excited to see what you build, please reach out!
i’m increasingly convinced that the best agent evals will come from mining real agent failure traces. my view is that every failed trace contains a potential eval but not in its raw form. raw traces are messy, long and too specific. the research problem is to distill them into clean reproducible tests. the pipeline i’m interested in is (which i'm currently working on):
failure trace → failure attribution → earliest divergence point → minimal reproducible state → targeted eval → regression suite
this turns trace data from passive observability into an active improvement loop. like can we extract the exact decision point where the agent should have behaved differently? and can we convert that into an eval that catches the same failure class in the future? i guess this matters because most agent failures are trajectory-level failures and not just output-level failures.
personally i think this is much more realistic than relying only on hand-written benchmarks (imo they should look more like failure memory systems). hand-written evals encode what we think agents will fail on. traces encode what agents actually failed on. also once you have the mechanism, you can mutate the trace into variants. that is basically fuzzing for agents.
Glad to see this -- renderers are a foundational component of the LLM stack. Renderers map between tokens and messages, which are invariant to tokenizer and formatting details. Most APIs, datasets, and RL environments are defined in terms of messages.
Getting the details wrong leads to train-test mismatches, caching inefficiencies, and prompt injection vulnerabilities. We included a renderers module in Tinker Cookbook, but it makes sense as a standalone library.
Our RL + post training results. Thanks @baseten for collab :
- overlapping confidence intervals between frontier model and a open source model was the least expected
- With right harness, a 27B model with an iterative SFT can do legit end-to-end legal tasks (in some cases)
- specialized post-training can push the cost/latency curve in a very different direction
- some implicit learnings: models tend to make fewer grep calls, more full-document reading, more synthesis, more self-correction on legal work.
Big takeaway for me is that agent performance is starting to look more like a systems + memory problem, not just a raw model scale problem.
@billxbf very cool, absolutely the way it should be done!
i do wanna flag that we've had support for a similar proxy pattern in prime-rl + verifiers for quite some time to support BYO harnesses, and this citation is a bit mistaken :) https://t.co/eBIhXRkaaw
Deep learning bros and sisters, don't sleep on this.
You can cluster millions of documents in embedding space, mass-annotate them, visualize them... basically for free and within seconds.
Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions?
In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data).
The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights.
https://t.co/eMZIGQDf2f
Incredibly important problem, especially now that models are so capable. Exposing the right abstractions to make things easy without getting in their way is essential!
a fun yet challenging part of designing ML tooling for broad audiences is deciding which things people are allowed to change
you want max power, min complexity, and an opinionated path to success
so you gotta find recipes that generalize across models and domains. fun problem