Chris Samarinas

@CSamarinas

CS PhD at CIIR @manningcics, founder of @NagetInc. Researcher in NLP & Information Retrieval. Search, nuggets, search.

United States

Joined March 2020

1.1K Following

384 Followers

219 Posts

Pinned Tweet

Chris Samarinas @CSamarinas

about 1 year ago

📢 New paper on scaling test-time compute for document re-ranking Do you want to know how to train compact 2-3B models that can reach the performance of 70B+ LLMs in reasoning-intensive ranking? 📄Check out the distillation + RL recipe in our paper: https://t.co/uGnLdOW0cw

Chris Samarinas @CSamarinas

18 days ago

@matospiso I don't get the hype about multi-vector "late interaction". Dense embeddings have a capacity bottleneck. Sparse models like SPLADE are pretty good when trained well. Why bother going from multi-vector back to sparse just to make it scalable? Just use sparse from the beginning.

105

Chris Samarinas @CSamarinas

19 days ago

@TheIshanGoswami @NagetInc will eat Exa alive Ishan, at least you better figure out how to drop your per query cost haha, companies that follow the old playbook don't have future in this era

Chris Samarinas @CSamarinas

about 2 months ago

This baby is crawling 2 billion pages per month and hitting 1,400 tokens/s. The room stays at 30+°C (86°F) with a signature Founder Mode scent of ionized ozone and 'Eau de Silicon'. #WebSearch #LLMs #GPUPoor #Naget

CSamarinas's tweet photo. This baby is crawling 2 billion pages per month and hitting 1,400 tokens/s. The room stays at 30+°C (86°F) with a signature Founder Mode scent of ionized ozone and 'Eau de Silicon'.

#WebSearch #LLMs #GPUPoor #Naget https://t.co/8JFfKe6Xg0

137

Who to follow

Yuqiang Xie

@IndexFziQ

AI Researcher focusing on storytelling and cognitive modeling. Ph.D. in NLP at IIE, CAS.

Negar Arabzadeh

@NegarEmpr

Postdoc @UCBerkeley @BerkeleySky |👩🏻‍💻Prev @google, @MSFTResearch, @SpotifyResearch | 📚@UWaterloo | Interested in Information Retrieval

Qingyao Ai

@QingyaoAi

Associate Professor @ Tsinghua University. Interested in IR and ML. Google Scholar: https://t.co/hHVggxFDfV

Chris Samarinas @CSamarinas

3 months ago

@ExaAILabs You guys need to move fast or sell out before it's too late, because @NagetInc will eat Exa for breakfast soon :)

Chris Samarinas @CSamarinas

4 months ago

@tomaarsen @ExaAILabs no they won't, but stay tuned for @NagetInc, we will open source everything including our whole billion-scale index to run on your own hardware

Chris Samarinas @CSamarinas

4 months ago

@levelsio You have a point, but that's how you get worse censored well behaved models. Keeping a company private gives more control to produce better products without worrying about the market reaction so much.

Chris Samarinas @CSamarinas

4 months ago

@garrytan @HappenstanceAI cool but very low hanging fruit and no moat, soon with @NagetInc you will do reasoning-intesive search on the entire web on any type of entity you want, including your personal data, and you can even rely on local compute

Chris Samarinas @CSamarinas

5 months ago

@ruslanjabari building reasoning-native web search from the ground up @NagetInc, the end goal is to build the world's first always on proactive discovery agent that partially runs on local hardware and brings hyper-targeted content to you before you even know you need it

Chris Samarinas @CSamarinas

5 months ago

@vkhosla the timeline for these is very off, add a few more decades

113

Chris Samarinas @CSamarinas

7 months ago

@zephyr_z9 That's what happens with simple API wrappers. We are building an actual search engine @NagetInc that will finally bring some real competition in the search market.

Chris Samarinas @CSamarinas

8 months ago

@TheSeaMouse You can have 100s of papers but have no clue how to do research that can help solve the right problems and build something useful. Just being good at publishing many papers only adds PR value to a company.

749

CSamarinas retweeted

Taylor Sorensen @ma_tay_

8 months ago

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵

ma_tay_'s tweet photo. 🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!)

We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈

1/🧵 https://t.co/P9PJgT9u5j

197

136

68K

CSamarinas retweeted

Andrej Karpathy

@karpathy

8 months ago

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

karpathy's tweet photo. Excited to release new repo: nanochat!
(it's among the most unhinged I've written).

Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI.

It weighs ~8,000 lines of imo quite clean code to:

- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with "GRPO"
- Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI.
- Write a single markdown report card, summarizing and gamifying the whole thing.

Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc.

My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.

Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

684

24K

18K

Chris Samarinas @CSamarinas

8 months ago

@YichuanM @lateinteraction It's definitely not frequent, but on very big scale even a monthly update can be costly. HNSW also suffers from the same problem. On small scales / slowly growing collections and academic benchmarks these don't matter obviously.

Chris Samarinas @CSamarinas

8 months ago

@lateinteraction @YichuanM With neural sparse representations you don't have big impact on recall when you keep adding items, mostly on efficiency, dense single or multi-vector representations need rebuilding of the ANN structure periodically, otherwise recall drops if eg. the centroids are not recomputed.

Chris Samarinas @CSamarinas

8 months ago

@YichuanM @lateinteraction You can do incremental updates, but after certain point you need to redo the clustering for the centroids, or whatever structure the ANN index is using. With sparse you can continue adding to the index without periodically rebuilding it.

Chris Samarinas @CSamarinas

9 months ago

@lateinteraction Open web search is coming, we are working on this :)

Chris Samarinas @CSamarinas

9 months ago

@lateinteraction @Julian_a42f9a @orionweller Once I start scaling the single embedding size similar to Colbert multi-vector for given expected document lengths, I didn't see much difference in practice. sqrt(N) optimization sounds intriguing though. Which paper is that? If impact on retrieval perf is small it would be cool.

128

Chris Samarinas @CSamarinas

9 months ago

@Julian_a42f9a @orionweller I think it's all about information capacity of the representations. To some extent, capacity can be offloaded to a learned scoring func and the upper bound would still be the cross-attention model. Scaling reps with doc length is important, but def not token-level Colbert style.

Chris Samarinas

@CSamarinas

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users