Anas Ait aomar

@anas_aito

I tweet #Visual materials about #Data, #NLP, and #Graphs.

Joined March 2019

582 Following

78 Followers

153 Posts

Pinned Tweet

Anas Ait aomar @anas_aito

over 4 years ago

Hello Community This is my first #thread, and I wanted to dedicate it to all people interested in #Data Science and #ML or who are self-taught like me. I want to share with you some of my #mental models that guided my learning journey 1/N

anas_aito retweeted

Ben Clavié

@bclavie

23 days ago

I have a deeper note to make about this: we need to rethink how we approach retrieval research if we want to have an LLM moment. I think a problem we have as a sub-field is a lack of openness to early research that might be paving the way to what comes next, even if it's not all that good yet. Let me explain: You might've noticed that we wrote both paper&blog in a way that almost doesn't care that the results are impressively good (essentially sparse SOTA for ~bert-sized model models). The reason for this is that I truly don't think that the retrieval performance matters here, beyond proving that the method contains signal. There has been a lot of progress in (applied) retrieval world and embeddings in the last few years, but one thing is still pretty apparent: we understand very little about how things work, and why they work. We've developed better methods, but they are largely a result of more compute + more refined pipelines + more training, etc. It works, but it's brute-forced, and the results are improving but not revolutionising the world. Between 2020 and now, our understanding of what makes a retrieval model "good" has progressed, but not to the extent that we know what *makes* it work. ColBERT's maxsim operator, perhaps unbeknownst to @lateinteraction at the time, is still one of the most informative tool, because it shows us what's possible when we go beyond expressivity-limited scoring operators, even if it's still incredibly naive. One thing that I'm very proud of is that at @mixedbreadai, we made a bet that the way representations are expressed almost matters less than how it is used, which has justified a lot of our (very time-consuming) engineering and research decisions, but I think it's the right decision to actually understand how neural representations can lead to better retrieval. We've pushed late interaction pretty far, and we are very much working on the next steps of late interaction, one discovery at a time. Information retrieval is a family of tools. Single-vector models, multi-vector models, SPLADE, etc... are just some of the tools in our toolkit. Making them iteratively better is not, IMO, how we get to the end goal. Understanding *what* makes a given method better and going all-in on figuring out what its representations can tell us about training and representation dynamics is, I think, the right way forward. Back to my original point: we need to encourage more out-there, kooky ideas that are currently borderline useless but show great promise towards the future! One of my problem with some of the formal review cycles is that far too much importance is placed on what, I feel, should be an entirely separate, more engineering-focused paper: does this run in XXms? How does it perform on TREC-DL? What's the index size? These are valid questions. But they shouldn't be asked of work that is exploring concepts. To me, it feels like rejecting Attention is All You Need because it's pointless to rely on a quadratic method to convey information across tokens, it'll never scale. I want to read more exploratory work whose limitation is basically "okay yeah, we can't really deploy this, but there is undeniably something going on here. It might take 3 more papers, but we need to understand it. We can make it scale later.". I want more of these to spark discussions at conferences about the why, with the how staying at the conceptual level -- "how" at production-scale can come later. Honestly, I even want to read papers that aren't quite sure why something works, but that have some informed opinions about it and want to show that it does. When writing the SAE+BM25 paper, I actually had started a whole section on efficiency, making the model more or less sparse, how it impacts performance and vocab distribution, etc... Then I decided to take it all out: pages are limited, and that's not the message I want the reader to get out of the paper. The message is that these indexable sparse structures emerge from dense models. That's incredibly cool. It opens up the door to dozens of follow-up studies. Maybe it's one of the early signals that will lead to a major breakthrough in a paper or ten. An efficiency study would be spending time and space on things that are worth studying, but are an entirely orthogonal point that should be made separately. I have the freedom to do this because I have the incredible luxury of working at a very strongly-minded industry lab. We get daily feedback from users and know what matters and what doesn't, and if the paper gets rejected, it will have absolutely no impact on my work or career. Many, many extremely talented researchers don't have that luxury, and spend precious human and GPU-weeks on optimising for the wrong problems, too early. And optimising for efficiency before discovering the true performance potential is one of the best ways to miss the big discovery in exchange for a smaller but more guaranteed payoff. In LLM world, the massive breakthroughs came from similar freedom from worries. Scaling to GPT-2 made absolutely no sense from a publishable unit or GPU-rentability point of view, but it paved the way to understanding the generative potentials of transformers. I think retrieval is key to the knowledge economy we're going to live in. The smartest agents will need knowledge from the world, no matter how genius they are. I'm very happy that we're doing this work, but I'd love to see many more people have the ability to join this gigantic effort. We'd benefit immensely as a field from supporting and celebrating exploratory research that let us develop the new generation of tools that will power this agentic knowledge era.

124

15K

anas_aito retweeted

Anas Moujahid

@Anas__Moujahid

23 days ago

Credit to @garrytan for putting numbers on a public bench and open-sourcing the eval. This was the trigger for us to evaluate Grove (our ai agents' proprietary brain) against the same standard. Grove led LongMemEval at 99.36% R@5 and 99.79% R@10 (n=470, LongMemEval's own scorer unmodified). Grove is a typed knowledge graph designed for AI agents, combining bitemporal state, identity resolution, and source-pinned provenance. Memory is part of its capabilities as it continuously ingests new activity from the systems where the work actually happens, fades stale facts, and refutes those that have been superseded, which makes it self-evolving and self-healing by design. LongMemEval tests one slice of that: flat retrieval over chat history. The capabilities Grove was built to provide aren't on this bench yet: – Multi-hop graph traversal across typed entities: which goals depend on which decisions – Bitemporal state: what was true on a date, what has since been refuted – Identity resolution: aliases ("the CTO", "Bob", "Martinez Bob") to one canonical node with a confidence score – Provenance: every fact traces to a message id, commit hash, or meeting note More about our methodology, comparator architectures, cost analysis here: https://t.co/CWHnaIzVAT

Anas__Moujahid's tweet photo. Credit to @garrytan for putting numbers on a public bench and open-sourcing the eval. This was the trigger for us to evaluate Grove (our ai agents' proprietary brain) against the same standard.

Grove led LongMemEval at 99.36% R@5 and 99.79% R@10 (n=470, LongMemEval's own scorer unmodified).

Grove is a typed knowledge graph designed for AI agents, combining bitemporal state, identity resolution, and source-pinned provenance. Memory is part of its capabilities as it continuously ingests new activity from the systems where the work actually happens, fades stale facts, and refutes those that have been superseded, which makes it self-evolving and self-healing by design.

LongMemEval tests one slice of that: flat retrieval over chat history. The capabilities Grove was built to provide aren't on this bench yet:
– Multi-hop graph traversal across typed entities: which goals depend on which decisions
– Bitemporal state: what was true on a date, what has since been refuted
– Identity resolution: aliases ("the CTO", "Bob", "Martinez Bob") to one canonical node with a confidence score
– Provenance: every fact traces to a message id, commit hash, or meeting note

More about our methodology, comparator architectures, cost analysis here: https://t.co/CWHnaIzVAT

695

Anas Ait aomar @anas_aito

3 months ago

@marouane53 + both rabitq and turboquant are the different flavours of JL transform a cheap dim reduction everyone using. Idk why they did not mention the similarity. Simply rotate using a random base and you get guarantees on distance preservation.

362

Who to follow

Badr MOUFAD

@badr_moufad

PhD candidate @ Ecole polytechnique | Putting diffusion models on the right path

SeigneurMisterSanctus

@mistersanctus1

It’s all about being nice and genuine.

Dr. Bioinformatician

@havardluvU

Big fan of science and medicine. AI ,Computational biology, Bioinformatics, Machine Learning, Deep Learning . DataMining.

anas_aito retweeted

General Reasoning

@GenReasoning

3 months ago

Introducing OpenReward. 🌍 330+ RL environments through one API ⚡ Autoscaled sandbox compute 🍒 4.5M+ unique RL tasks 🚂 Works like magic with Tinker, Miles, Slime Link and thread below.

GenReasoning's tweet photo. Introducing OpenReward.

🌍 330+ RL environments through one API
⚡ Autoscaled sandbox compute
🍒 4.5M+ unique RL tasks
🚂 Works like magic with Tinker, Miles, Slime

Link and thread below. https://t.co/4fIlVKUkOF

192

244K

anas_aito retweeted

Daniel Hnyk @hnykda

3 months ago

LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below

307

anas_aito retweeted

Thibault Formal

@thibault_formal

3 months ago

New sparse retrieval model: introducing SPLARE, which extends SPLADE by replacing the vocabulary head with pretrained SAEs! paper: https://t.co/Un2zhX14KR (ICLR'26) also how we won the WSDM'26 Cup on multilingual retrieval: https://t.co/77QlgZsnls (model weights coming soon!)

anas_aito retweeted

Anas Moujahid

@Anas__Moujahid

4 months ago

thanks for sharing @henloitsjoyce! ccx-v2 is coming with memory layer, more security, and more autonomy! :))

anas_aito retweeted

Mixedbread @mixedbreadai

4 months ago

Introducing Mixedbread Wholembed v3, our new SOTA retrieval model across all modalities and 100+ languages. Wholembed v3 brings best-in-class search to text, audio, images, PDFs, videos... You can now get the best retrieval performance on your data, no matter its format.

mixedbreadai's tweet photo. Introducing Mixedbread Wholembed v3, our new SOTA retrieval model across all modalities and 100+ languages.

Wholembed v3 brings best-in-class search to text, audio, images, PDFs, videos...

You can now get the best retrieval performance on your data, no matter its format. https://t.co/PYT3Ryerxm

949

119

757

203K

Anas Ait aomar @anas_aito

4 months ago

@matospiso Nice trick ! Is this something that inherits the guarantees of JL lemma or totally different i could not connect to it besides the random base of sampled anchors. Other than this, really nice idea !

anas_aito retweeted

tokenbender

@tokenbender

10 months ago

have had many questions in DMs about starting out -> training/RL your own models. there are gaps even with the presence of nanogpt and post training notebooks out in the open. would appreciate if you can let me know of more. would do a write up to bridge the understanding.

312

367

42K

Anas Ait aomar @anas_aito

10 months ago

@marouane53 Only the paranoid survive

230

anas_aito retweeted

Hynek Kydlíček @HKydlicek

10 months ago

We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

HKydlicek's tweet photo. We are releasing 📄 FinePDFs:
the largest PDF dataset spanning over half a billion documents!

- Long context: Documents are 2x longer than web text
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora. https://t.co/Ikyh2lx6NF

711

113

408

200K

anas_aito retweeted

Joseph Suarez 🐡

@jsuarez

10 months ago

https://t.co/mY5lwKvKD0

681

948

126K

anas_aito retweeted

Jack Morris

@jxmnop

10 months ago

first i thought scaling laws originated in OpenAI (2020) then i thought they came from Baidu (2017) now i am enlightened: Scaling Laws were first explored at Bell Labs (1993)

jxmnop's tweet photo. first i thought scaling laws originated in OpenAI (2020)

then i thought they came from Baidu (2017)

now i am enlightened:
Scaling Laws were first explored at Bell Labs (1993) https://t.co/CAZPgrxGCX

165

778

295K

Anas Ait aomar @anas_aito

12 months ago

@marouane53 Video search is a hard problem ! Congrats on the product look very cool.

920

anas_aito retweeted

Jimmy Lin

@lintool

12 months ago

It’s been ~4 weeks since we launched @yupp_ai – a consumer-first approach to robust & trustworthy AI evaluation. We’re still early but have already gathered 2M+ high-quality human preference feedback datapoints on 500+ models across diverse use cases. 🧵 https://t.co/jmJK4lKJcl

33K

Anas Ait aomar @anas_aito

about 1 year ago

@doesdatmaksense @philipvollet And coverage guaranties !

210

Anas Ait aomar @anas_aito

about 1 year ago

@excalidraw Palette maybe ?

Anas Ait aomar @anas_aito

about 1 year ago

@marouane53 It was a reply thatbthe media distributed the hell Of it !

293

anas_aito retweeted

Ai2 @allen_ai

over 1 year ago

Can AI really help with literature reviews? 🧐 Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth, detailed, and contextual answers with table comparisons, expandable sections for subtopics, and citations with paper excerpts for verification 💡 Try Ai2 ScholarQA now: https://t.co/Av9Pp3uliK More in 🧵

allen_ai's tweet photo. Can AI really help with literature reviews? 🧐

Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth, detailed, and contextual answers with table comparisons, expandable sections for subtopics, and citations with paper excerpts for verification 💡

Try Ai2 ScholarQA now: https://t.co/Av9Pp3uliK

More in 🧵

219

133

42K

Anas Ait aomar

@anas_aito

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users