gautham @capemox - Twitter Profile

Pinned Tweet

about 1 month ago

Pretty big release of bleve out! GPU accelerated vector search (I helped on this hehe), binary quantization and a lot more! https://t.co/VtcHZ2Ql0D

2

4

0

641

gautham

@capemox

about 3 hours ago

@derangineer @Robro612 I think a paper reading club would be amazing, I would definitely join!

0

1

0

30

gautham

@capemox

about 14 hours ago

@LakshyAAAgrawal Thank you! Big fan of your work btw

0

1

0

18

gautham

@capemox

about 21 hours ago

This is super cool! This could be a scalable way to automate hard negative generation for training embedding models as well

Lakshya A Agrawal

@LakshyAAAgrawal

1 day ago

Excited to see the use of GEPA-optimized LLM judges for data filtering in MAI-Thinking-1 model's pre-training pipeline!

LakshyAAAgrawal's tweet photo. Excited to see the use of GEPA-optimized LLM judges for data filtering in MAI-Thinking-1 model's pre-training pipeline! https://t.co/wAtVx3KEUE

3

150

20

64

47K

2

8

1

5

1K

gautham

@capemox

about 17 hours ago

@Robro612 @bclavie @mixedbreadai woah I just noticed you joined mxbai, congrats!

0

3

0

2K

gautham

@capemox

about 18 hours ago

IR research imo tends to limit itself specifically behind index infra: you need vectors, and they're dense, sparse, or (more recently) multivec. You try to conform your new ideas to fall behind these camps But this doesn't have to be the case! The future is probably gonna look really different. If the method is good enough, the retrieval systems will build themselves around it. Late interaction crossed that hill: only last year there were thousands of articles about how "slow" it was for first stage retrieval, and now we have colbert models and fast indexes popping up everywhere

Ben Clavié

@bclavie

about 18 hours ago

I have a deeper note to make about this: we need to rethink how we approach retrieval research if we want to have an LLM moment. I think a problem we have as a sub-field is a lack of openness to early research that might be paving the way to what comes next, even if it's not all that good yet. Let me explain: You might've noticed that we wrote both paper&blog in a way that almost doesn't care that the results are impressively good (essentially sparse SOTA for ~bert-sized model models). The reason for this is that I truly don't think that the retrieval performance matters here, beyond proving that the method contains signal. There has been a lot of progress in (applied) retrieval world and embeddings in the last few years, but one thing is still pretty apparent: we understand very little about how things work, and why they work. We've developed better methods, but they are largely a result of more compute + more refined pipelines + more training, etc. It works, but it's brute-forced, and the results are improving but not revolutionising the world. Between 2020 and now, our understanding of what makes a retrieval model "good" has progressed, but not to the extent that we know what *makes* it work. ColBERT's maxsim operator, perhaps unbeknownst to @lateinteraction at the time, is still one of the most informative tool, because it shows us what's possible when we go beyond expressivity-limited scoring operators, even if it's still incredibly naive. One thing that I'm very proud of is that at @mixedbreadai, we made a bet that the way representations are expressed almost matters less than how it is used, which has justified a lot of our (very time-consuming) engineering and research decisions, but I think it's the right decision to actually understand how neural representations can lead to better retrieval. We've pushed late interaction pretty far, and we are very much working on the next steps of late interaction, one discovery at a time. Information retrieval is a family of tools. Single-vector models, multi-vector models, SPLADE, etc... are just some of the tools in our toolkit. Making them iteratively better is not, IMO, how we get to the end goal. Understanding *what* makes a given method better and going all-in on figuring out what its representations can tell us about training and representation dynamics is, I think, the right way forward. Back to my original point: we need to encourage more out-there, kooky ideas that are currently borderline useless but show great promise towards the future! One of my problem with some of the formal review cycles is that far too much importance is placed on what, I feel, should be an entirely separate, more engineering-focused paper: does this run in XXms? How does it perform on TREC-DL? What's the index size? These are valid questions. But they shouldn't be asked of work that is exploring concepts. To me, it feels like rejecting Attention is All You Need because it's pointless to rely on a quadratic method to convey information across tokens, it'll never scale. I want to read more exploratory work whose limitation is basically "okay yeah, we can't really deploy this, but there is undeniably something going on here. It might take 3 more papers, but we need to understand it. We can make it scale later.". I want more of these to spark discussions at conferences about the why, with the how staying at the conceptual level -- "how" at production-scale can come later. Honestly, I even want to read papers that aren't quite sure why something works, but that have some informed opinions about it and want to show that it does. When writing the SAE+BM25 paper, I actually had started a whole section on efficiency, making the model more or less sparse, how it impacts performance and vocab distribution, etc... Then I decided to take it all out: pages are limited, and that's not the message I want the reader to get out of the paper. The message is that these indexable sparse structures emerge from dense models. That's incredibly cool. It opens up the door to dozens of follow-up studies. Maybe it's one of the early signals that will lead to a major breakthrough in a paper or ten. An efficiency study would be spending time and space on things that are worth studying, but are an entirely orthogonal point that should be made separately. I have the freedom to do this because I have the incredible luxury of working at a very strongly-minded industry lab. We get daily feedback from users and know what matters and what doesn't, and if the paper gets rejected, it will have absolutely no impact on my work or career. Many, many extremely talented researchers don't have that luxury, and spend precious human and GPU-weeks on optimising for the wrong problems, too early. And optimising for efficiency before discovering the true performance potential is one of the best ways to miss the big discovery in exchange for a smaller but more guaranteed payoff. In LLM world, the massive breakthroughs came from similar freedom from worries. Scaling to GPT-2 made absolutely no sense from a publishable unit or GPU-rentability point of view, but it paved the way to understanding the generative potentials of transformers. I think retrieval is key to the knowledge economy we're going to live in. The smartest agents will need knowledge from the world, no matter how genius they are. I'm very happy that we're doing this work, but I'd love to see many more people have the ability to join this gigantic effort. We'd benefit immensely as a field from supporting and celebrating exploratory research that let us develop the new generation of tools that will power this agentic knowledge era.

4

98

18

42

11K

3

20

2

4

2K

capemox retweeted

Ben Clavié

@bclavie

about 18 hours ago

@capemox Exactly this. I really don’t think the availability of infra should have any bearing whatsoever on whether an idea makes it into the world. We need to encourage papers whose infra is in the future

1

7

2

0

561

gautham

@capemox

about 18 hours ago

Imagine if MOEs didn't happen because it's annoying to train and build infra around lmao throw deep

0

3

0

103

gautham

@capemox

about 21 hours ago

can definitely be used for filtering datasets for unsupervised contrastive pretraining, I'm gonna try this lol

0

3

1

0

197

gautham

@capemox

1 day ago

@lebrechts @soldni @kylelostat @HannaHajishirzi yall are very cool

0

3

0

37

gautham

@capemox

1 day ago

A good thing coming out of Ai2 researchers going to MS

elie

@eliebakouch

1 day ago

WOW microsoft new "MAI Thinking 1" model comes with a 109 page tech report that looks REALLY detailed, this is amazing

23

949

117

662

185K

1

13

0

4

2K

gautham

@capemox

1 day ago

@aaxsh18 vagueposting pro max

1

2

0

130

gautham

@capemox

1 day ago

I think a good RL rollout with great retrievers (not just grep) could help make a big dent in OBLIQ

Jasper Lu

@lu__jasper

2 days ago

Getting back around to this. OBLIQ is a really interesting benchmark, and feels like the right one for this space. It's almost gratuitously hard, but seems pretty well-aligned with interesting agent observability problems. Saturation on this set would probably solve a lot of more common real-world use cases along the way.

lu__jasper's tweet photo. Getting back around to this. OBLIQ is a really interesting benchmark, and feels like the right one for this space.

It's almost gratuitously hard, but seems pretty well-aligned with interesting agent observability problems. Saturation on this set would probably solve a lot of more common real-world use cases along the way.

1

11

3

6

7K

2

13

2

10

3K

gautham

@capemox

1 day ago

@BenjDicken https://t.co/q2SOkpCrS2 This paper should give you a good idea of how attention helps for RNNs. It's a good precursor to the transformer paper. You can follow the citations backwards if you want more of an idea of how things were done pre-attention

0

3

0

6

228

gautham

@capemox

1 day ago

Open invitation for gpus btw, I'll open source everything

gautham

@capemox

1 day ago

@andreer @raphaelsrty I'd love to, but it depends on how much compute I have :/. The number of possible downstream models (multivec, dense, sparse) and domains are pretty large. Since this is mostly self funded, I cant really do too much

0

1

0

414

0

4

0

1

352

gautham

@capemox

1 day ago

@andreer @raphaelsrty I'd love to, but it depends on how much compute I have :/. The number of possible downstream models (multivec, dense, sparse) and domains are pretty large. Since this is mostly self funded, I cant really do too much

0

1

0

414

gautham

@capemox

7 days ago

Update: pretrained the ettin 17m and 32m on 10 million of the DenseOn corpus. I specifically chose the QA & retrieval subsets. For 17m, got pretty mixed results for some BEIR benchmarks. It's an overall win, but some datasets suffered.

capemox's tweet photo. Update: pretrained the ettin 17m and 32m on 10 million of the DenseOn corpus. I specifically chose the QA & retrieval subsets.

For 17m, got pretty mixed results for some BEIR benchmarks. It's an overall win, but some datasets suffered. https://t.co/HoN7itJhF3

2

10

0

2

623

gautham

@capemox

2 days ago

@SilvioMartinico Thanks so much, this is definitely needed

0

2

0

154

gautham

@capemox

2 days ago

@andreer @raphaelsrty I did end up creating better checkpoints! I made a post about it: https://t.co/WWf1Zt8UcS

gautham

@capemox

4 days ago

New ettin-32m and ettin-17m pretrained models are out! These are much better starting points for embedding tasks. Some more details in the thread:

capemox's tweet photo. New ettin-32m and ettin-17m pretrained models are out! These are much better starting points for embedding tasks. Some more details in the thread: https://t.co/elC7xekcGn

2

26

2

10

2K

1

2

0

122

gautham

@capemox

3 days ago

@antoine_chaffin Will put up details on a blog soon, but it's pretty much what y'all do :P

0

3

0

64

gautham

@capemox

4 days ago

New ettin-32m and ettin-17m pretrained models are out! These are much better starting points for embedding tasks. Some more details in the thread:

2

26

2

10

2K

gautham

@capemox

4 days ago

@LightOnIO Here’s the models and datasets on hf: https://t.co/Mtq9PHe1mB https://t.co/Zmqnl7Vn77 https://t.co/CoFrwqWu44

0

6

0

3

170

gautham

@capemox

4 days ago

@LightOnIO Stage 2 (S2) fine-tuning was done on the https://t.co/lsIqnAlLdh dataset. I decided to go with this because it’s a high quality dataset despite being small, and so wouldn’t be a pain to train.

1

5

0

2

226

gautham

@capemox

Last Seen Users on Sotwe

Trends for you

Most Popular Users