Matt Weber

Engineer at @turbopuffer, Lucene committer

10 months ago

Talked to a VP of Engineering yesterday and his number one concern was lack of access to search engineering talent (US).

15

135

4

51

16K

mrweber retweeted

Adrien Grand @jpountz

11 months ago

I spent some time looking at the Vespa source code to see how it compares with Lucene https://t.co/VsGV31qPuL

2

58

9

36

6K

Who to follow

Adrien Grand

@jpountz

Robert Charles Muir

@rcmuir

durum patientia frango

Mark Harwood

@elasticmark

Ex @elastic. Search guy. Currently working on @andorsearch

mrweber retweeted

11 months ago

Elastic’s license rug pull, meant to hurt Amazon, ended up making OpenSearch the default. Most orgs I talk to run OpenSearch over Elasticsearch. Probably the biggest strategic misfire in the software industry. Also a perfect example of why the freedom to fork matters.

8

139

4

33

8K

Matt Weber @mrweber

11 months ago

@jobergum @jpountz Wow, join Meta? 😂

0

46

mrweber retweeted

11 months ago

New Vespa features covered in the June newsletter: - Layered ranking: Rank chunks in documents. - Elementwise bm25 - top, filter_subspaces, and cell_order tensor functions - chunking support in indexing - element-gap: Proximity over chunks - filtering in grouping results - allowDropAll in weakAnd - relevance eval support in PyVespa - Support for private HuggingFace models - Azure zones general availability - Choose query tokenization, composition and syntax separately - Give query tokenization control to linguistics - Multiple tokens support in Lucene linguistics - Detection confidence in language detection - initial-inflight-factor: Faster feed speed ramp-up - Vectorized int8 instructions - Hex tensor rendering option - Case sensitive matching option - prioritize-availability option for query routing over groups - equiv query items can now be nested inside near and onear ... are we shipaholics? 😬

1

12

4

1

813

mrweber retweeted

Radu Gheorghe

@radu0gheorghe

11 months ago

June @vespaengine newsletter is out! Lots of cool new stuff (e.g. built-in chunking) and educational content (e.g. demo E-commerce apps with new ideas) Check it out and let us know of any feedback: https://t.co/qtl55FzQir

0

7

3

0

193

mrweber retweeted

11 months ago

Until now you've either had to - index document chunks as separate documents, creating a billion documents with no context, or - index entire documents with many chunks, preserving context but feeding too much noise to LLMs.

1

7

1

0

284

mrweber retweeted

Adrien Grand @jpountz

over 1 year ago

The change has been merged and nightly benchmarks just caught up: https://t.co/kFAEU59BcY. I don't remember many improvements of this magnitude to non-trivial workloads in Lucene's history.

0

14

1

2K

mrweber retweeted

over 1 year ago

🚀 The Rise of Vision RAG! Launching a complete RAG app that you can deploy to production in minutes! - Hybrid fusion of ColPali + BM25 with @vespaengine - Gemini 1.5 Flash-8B - FastHTML frontend - Runs on Huggingface Spaces Interpretable SERP with snippets + patch highlights! RAG with ColPali doesn't need to be sluggish. Huge s/o to the team that built it @thomas_thoresen @andreer @ldalves

19

1K

180

2K

108K

mrweber retweeted

over 1 year ago

Let's be honest for a second: Building great retrieval systems is about people and knowledge (at least for a few years until AGI replaces us all, and we can retreat to our woodworking shops). https://t.co/WdCY0K2SPk

2

16

3

7

2K

mrweber retweeted

over 1 year ago

If you have worked in search, you know how freaking hard even getting started with something close to this with traditional methods. Now, you can zero-shot it.

jobergum's tweet photo. If you have worked in search, you know how freaking hard even getting started with something close to this with traditional methods. Now, you can zero-shot it. https://t.co/EwkOFcbhIm

5

190

11

166

15K

mrweber retweeted

over 1 year ago

Comparing Elasticsearch with Vespa. Complex workloads. It should be a good read.

7

132

11

123

11K

mrweber retweeted

over 1 year ago

Announcing global significance in Vespa: https://t.co/BbIzDA4WhX

1

3

1

0

502

mrweber retweeted

Daniel (dB.) Doubrovkine (parody of myself)

over 1 year ago

I'm excited about this demo cookbook that drops soon. ColQwen2 + @vespaengine + FastHTML Hybrid retrieval and ranking that scales to billions of PDF pages. It combines extracted text and visual embeddings from the ColPali architecture. Fully Interpretable. Notice the highlight of the term _non in the image of the page. Excellent work led by @thomas_thoresen! 🚀

jobergum's tweet photo. I'm excited about this demo cookbook that drops soon.

ColQwen2 + @vespaengine + FastHTML

Hybrid retrieval and ranking that scales to billions of PDF pages. It combines extracted text and visual embeddings from the ColPali architecture. Fully Interpretable.

Notice the highlight of the term _non in the image of the page. Excellent work led by @thomas_thoresen! 🚀

7

398

49

445

24K

mrweber retweeted

@dblockdotorg

over 1 year ago

Yesterday, @linuxfoundation announced the new OpenSearch Software Foundation, with Amazon transferring the 3½ year old @OpenSearchProj to LF. This is something I am personally very proud of, because I worked on the 6-page proposal to move OpenSearch to a neutral foundation at Amazon. https://t.co/WkM7uAjv2l

0

15

4

0

485

mrweber retweeted

almost 2 years ago

This is amazing! A fresh 33M MiniLM-based ColBERT checkpoint that beats the original checkpoint (110M) and many other single-vector models on BEIR. 96 dimensions per token vector. With Vespa binarization support, this means only 12 bytes per token vector. I will add an ONNX version soon so you can import it directly into @vespaengine https://t.co/ReNx387jcM I've always been a massive fan of MiniLM, and our work from 2021 used only MiniLM for an end-to-end retrieval and ranking pipeline (single-vector => multi-vector => cross-encoder). https://t.co/Mz6eVSg8Jd

4

148

17

65

9K

mrweber retweeted