📢 New paper on scaling test-time compute for document re-ranking
Do you want to know how to train compact 2-3B models that can reach the performance of 70B+ LLMs in reasoning-intensive ranking?
📄Check out the distillation + RL recipe in our paper: https://t.co/uGnLdOW0cw
@matospiso I don't get the hype about multi-vector "late interaction". Dense embeddings have a capacity bottleneck. Sparse models like SPLADE are pretty good when trained well. Why bother going from multi-vector back to sparse just to make it scalable? Just use sparse from the beginning.
@TheIshanGoswami@NagetInc will eat Exa alive Ishan, at least you better figure out how to drop your per query cost haha, companies that follow the old playbook don't have future in this era
This baby is crawling 2 billion pages per month and hitting 1,400 tokens/s. The room stays at 30+°C (86°F) with a signature Founder Mode scent of ionized ozone and 'Eau de Silicon'.
#WebSearch#LLMs#GPUPoor#Naget
@tomaarsen@ExaAILabs no they won't, but stay tuned for @NagetInc, we will open source everything including our whole billion-scale index to run on your own hardware
@levelsio You have a point, but that's how you get worse censored well behaved models. Keeping a company private gives more control to produce better products without worrying about the market reaction so much.
@garrytan@HappenstanceAI cool but very low hanging fruit and no moat, soon with @NagetInc you will do reasoning-intesive search on the entire web on any type of entity you want, including your personal data, and you can even rely on local compute
@ruslanjabari building reasoning-native web search from the ground up @NagetInc, the end goal is to build the world's first always on proactive discovery agent that partially runs on local hardware and brings hyper-targeted content to you before you even know you need it
@zephyr_z9 That's what happens with simple API wrappers. We are building an actual search engine @NagetInc that will finally bring some real competition in the search market.
@TheSeaMouse You can have 100s of papers but have no clue how to do research that can help solve the right problems and build something useful. Just being good at publishing many papers only adds PR value to a company.
🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!)
We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈
1/🧵
Excited to release new repo: nanochat!
(it's among the most unhinged I've written).
Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI.
It weighs ~8,000 lines of imo quite clean code to:
- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with "GRPO"
- Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI.
- Write a single markdown report card, summarizing and gamifying the whole thing.
Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc.
My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.
Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
@YichuanM@lateinteraction It's definitely not frequent, but on very big scale even a monthly update can be costly. HNSW also suffers from the same problem. On small scales / slowly growing collections and academic benchmarks these don't matter obviously.
@lateinteraction@YichuanM With neural sparse representations you don't have big impact on recall when you keep adding items, mostly on efficiency, dense single or multi-vector representations need rebuilding of the ANN structure periodically, otherwise recall drops if eg. the centroids are not recomputed.
@YichuanM@lateinteraction You can do incremental updates, but after certain point you need to redo the clustering for the centroids, or whatever structure the ANN index is using. With sparse you can continue adding to the index without periodically rebuilding it.
@lateinteraction@Julian_a42f9a@orionweller Once I start scaling the single embedding size similar to Colbert multi-vector for given expected document lengths, I didn't see much difference in practice. sqrt(N) optimization sounds intriguing though. Which paper is that? If impact on retrieval perf is small it would be cool.
@Julian_a42f9a@orionweller I think it's all about information capacity of the representations. To some extent, capacity can be offloaded to a learned scoring func and the upper bound would still be the cross-attention model. Scaling reps with doc length is important, but def not token-level Colbert style.