David Golchinfar

@DavidGFar

Deutschland

Joined November 2009

179 Following

270 Followers

184 Posts

DavidGFar retweeted

Manuel Faysse

@ManuelFaysse

2 months ago

ViDoRe V3 has been accepted to ACL 2026! Much more relevant however are the many practitioners I met in recent days that praised the benchmark for it's quality and clean signal. It's not easy making a good retrieval benchmark (non-saturated, practical, wide domain coverage) but it helps having done 2 imperfect earlier iterations, understanding what is needed, and having 12k hours of human annotation (thanks @NVIDIAAI ) + 6 months of hard work from notably @MaceQuent1 @antonio_loison @antoine_edy) amongst others. From a scientific perspective, I believe we converged to a nice recipe that scales but remains challenging by mixing careful taxonomy design (@omrani_bilel), synthetic pre-annotations, and humans in many parts of the loop. This is by no means a cheap project - and going forward, non-trivial benchmarks will probably be more and more expensive. Gone are the days where using a LLM to annotate data creates a sufficient proxy for model improvement (ViDoRe V1). Some people asked me when ViDoRe V4 would be out. I would assume this should not come from us - we are reaching the limits of data annotation, what we need now are fully real queries and real documents from real users of VDR (and more generally IR) models. This comes with data privacy issues and is tough but I believe it should be the north star - too many datasets in IR are completely toy and optimizing them actually hurts the model. In all cases, ViDoRe V3 has a lot more to offer than just VDR (agents, RAG, etc) and should have at least a few months of non-saturation ahead of it!

ManuelFaysse's tweet photo. ViDoRe V3 has been accepted to ACL 2026! Much more relevant however are the many practitioners I met in recent days that praised the benchmark for it's quality and clean signal. It's not easy making a good retrieval benchmark (non-saturated, practical, wide domain coverage) but it helps having done 2 imperfect earlier iterations, understanding what is needed, and having 12k hours of human annotation (thanks @NVIDIAAI ) + 6 months of hard work from notably @MaceQuent1
@antonio_loison @antoine_edy) amongst others.

From a scientific perspective, I believe we converged to a nice recipe that scales but remains challenging by mixing careful taxonomy design (@omrani_bilel), synthetic pre-annotations, and humans in many parts of the loop. This is by no means a cheap project - and going forward, non-trivial benchmarks will probably be more and more expensive. Gone are the days where using a LLM to annotate data creates a sufficient proxy for model improvement (ViDoRe V1).

Some people asked me when ViDoRe V4 would be out. I would assume this should not come from us - we are reaching the limits of data annotation, what we need now are fully real queries and real documents from real users of VDR (and more generally IR) models. This comes with data privacy issues and is tough but I believe it should be the north star - too many datasets in IR are completely toy and optimizing them actually hurts the model.

In all cases, ViDoRe V3 has a lot more to offer than just VDR (agents, RAG, etc) and should have at least a few months of non-saturation ahead of it!

DavidGFar retweeted

Nicolas Boizard @N1colAIs

2 months ago

🚀 New model family release with an OMNIMODAL version ! After Eurobert, I'm excited to introduce BidirLM, a family of 5 frontier bidirectional encoders including an OMNIMODAL encoder at just 2.5B parameters. 🧵👇 https://t.co/AZzOJ6ZhhN

N1colAIs's tweet photo. 🚀 New model family release with an OMNIMODAL version !

After Eurobert, I'm excited to introduce BidirLM, a family of 5 frontier bidirectional encoders including an OMNIMODAL encoder at just 2.5B parameters.
🧵👇

https://t.co/AZzOJ6ZhhN https://t.co/xOdW5aBir9

15K

David Golchinfar

@DavidGFar

2 months ago

@antoine_chaffin With modernbert you guys really created a powerful encoder architecture! Your sotas don't lie 😎

David Golchinfar

@DavidGFar

2 months ago

@jrosenfeld13 Thank you!

154

David Golchinfar

@DavidGFar

2 months ago

We taught a 1.3M parameter model to play DOOM. It outperforms LLMs up to 92,000x its size. Happy Easter Monday! Here's our Easter egg release: SauerkrautLM-Doom-MultiVec-1.3M. 17.8 average points per episode. We benchmarked our tiny model against GPT-4o-mini (via OpenAI API), Nemotron-120B, Qwen3.5-27B, and Gemini Flash Lite (via OpenRouter API) on VizDoom's defend_the_center: - Our model: 17.8 avg points/episode, 31ms per decision, runs on CPU - Gemini Flash Lite: 0.8 avg points/episode (920ms latency) - Qwen3.5-27B: 0.67 avg points/episode (13.3s latency) - Nemotron-120B: 0.6 avg points/episode (8.9s latency) - GPT-4o-mini: 0.0 avg points/episode (just dodges, never engages) The architecture: ModernBERT-Hash We took hash embeddings (Svenstrup et al. 2017), previously only applied to the original BERT architecture (see @neumll 's BERT-Hash models), and brought them to ModernBERT, adding rotary position embeddings, alternating local/global attention, Flash Attention 2 support, and learned depth embeddings from VizDoom's depth buffer. The result is a 5-layer encoder with a 75-token character-level tokenizer (no BPE, every ASCII character is one token, preserving spatial structure), attention pooling, and a 4-action classification head. Total: 1,319,300 parameters, ~5MB on disk, 31ms inference on CPU. Trained on 31K frames of a human playing DOOM for about 2 hours. That's it. Fully open source. Everything you need to reproduce this: Model weights: https://t.co/bBvtlYFq2l Training data (31K frames): https://t.co/AyEXw4mwbp Code, training scripts, benchmark framework: https://t.co/GmPnTbQgAL Full paper with methodology included in the repo. Why does this matter beyond the fun factor? Small specialized models can decisively beat general-purpose LLMs at real-time control tasks. Not by a small margin, by 22x on average points per episode. At 1/400th the latency. On a CPU. For free. This has real implications for robotics, autonomous systems, game AI, and any domain where you need sub-100ms decisions on edge hardware. The future of AI isn't exclusively large. It's appropriately sized. Thank you to my co-authors Daryoush Vaziri (University of Applied Sciences Bonn-Rhein-Sieg) and Alexander Marquardt (Nara Institute of Science and Technology, CARE Laboratory) for their contributions to this work. Built with VizDoom, PyTorch, HuggingFace Transformers, and the ModernBERT architecture by @benjamin_warner , @antoine_chaffin, @ClavierBenjamin et al. Hash embedding approach inspired by NeuML's BERT-Hash models. #AI #DOOM #GameAI #SmallModels #OpenSource #ModernBERT #SauerkrautLM #VAGOSolutions #Easter #TinyML

DavidGFar's tweet photo. We taught a 1.3M parameter model to play DOOM. It outperforms LLMs up to 92,000x its size.

Happy Easter Monday! Here's our Easter egg release: SauerkrautLM-Doom-MultiVec-1.3M.

17.8 average points per episode.

We benchmarked our tiny model against GPT-4o-mini (via OpenAI API), Nemotron-120B, Qwen3.5-27B, and Gemini Flash Lite (via OpenRouter API) on VizDoom's defend_the_center:

- Our model: 17.8 avg points/episode, 31ms per decision, runs on CPU
- Gemini Flash Lite: 0.8 avg points/episode (920ms latency)
- Qwen3.5-27B: 0.67 avg points/episode (13.3s latency)
- Nemotron-120B: 0.6 avg points/episode (8.9s latency)
- GPT-4o-mini: 0.0 avg points/episode (just dodges, never engages)

The architecture: ModernBERT-Hash

We took hash embeddings (Svenstrup et al. 2017), previously only applied to the original BERT architecture (see @neumll 's BERT-Hash models), and brought them to ModernBERT, adding rotary position embeddings, alternating local/global attention, Flash Attention 2 support, and learned depth embeddings from VizDoom's depth buffer.

The result is a 5-layer encoder with a 75-token character-level tokenizer (no BPE, every ASCII character is one token, preserving spatial structure), attention pooling, and a 4-action classification head. Total: 1,319,300 parameters, ~5MB on disk, 31ms inference on CPU.

Trained on 31K frames of a human playing DOOM for about 2 hours. That's it.

Fully open source. Everything you need to reproduce this:

Model weights: https://t.co/bBvtlYFq2l
Training data (31K frames): https://t.co/AyEXw4mwbp
Code, training scripts, benchmark framework: https://t.co/GmPnTbQgAL
Full paper with methodology included in the repo.

Why does this matter beyond the fun factor?

Small specialized models can decisively beat general-purpose LLMs at real-time control tasks. Not by a small margin, by 22x on average points per episode. At 1/400th the latency. On a CPU. For free.

This has real implications for robotics, autonomous systems, game AI, and any domain where you need sub-100ms decisions on edge hardware. The future of AI isn't exclusively large. It's appropriately sized.

Thank you to my co-authors Daryoush Vaziri (University of Applied Sciences Bonn-Rhein-Sieg) and Alexander Marquardt (Nara Institute of Science and Technology, CARE Laboratory) for their contributions to this work.

Built with VizDoom, PyTorch, HuggingFace Transformers, and the ModernBERT architecture by @benjamin_warner , @antoine_chaffin, @ClavierBenjamin et al. Hash embedding approach inspired by NeuML's BERT-Hash models.

#AI #DOOM #GameAI #SmallModels #OpenSource #ModernBERT #SauerkrautLM #VAGOSolutions #Easter #TinyML

217

172

41K

David Golchinfar

@DavidGFar

2 months ago

@poshlain This sounds awesome. Would love to hear more!

117

David Golchinfar

@DavidGFar

2 months ago

@vivis_dev Thank you. There is more to come based on the monderbert-hash architecture. Stay tuned 😉

136

David Golchinfar

@DavidGFar

2 months ago

@maximelabonne Thank you 😊

154

DavidGFar retweeted

Hugging Models

@HuggingModels

4 months ago

Meet SauerkrautLM-Translator-LFM2.5-1.2B: a lean, multilingual translation powerhouse. It's not just another language model. It's a specialized translator trained with DPO for high-quality, nuanced text conversion. This is exciting for devs who need fast, accurate translation without massive compute.

HuggingModels's tweet photo. Meet SauerkrautLM-Translator-LFM2.5-1.2B: a lean, multilingual translation powerhouse. It's not just another language model. It's a specialized translator trained with DPO for high-quality, nuanced text conversion. This is exciting for devs who need fast, accurate translation without massive compute.

David Golchinfar

@DavidGFar

6 months ago

@maximelabonne Given the rapid development in the field of AI, our jobs are therefore "safe" for the next 3-6 months. Until the scale reaches saturation. Then, what felt like a moat will become a metric 😁

608

David Golchinfar

@DavidGFar

6 months ago

Proud to present you our new benchmark. LaptencyBench (VagoKart GP) LaptencyBench is a 13-minute closed-track evaluation where each “model” is deployed into a human driver and scored on best single-lap time (lower is better). Secondary signals: laps completed (throughput) and gap to SOTA (delta vs. the fastest lap). SOTA: Senior GLiNER set the pace with a 1:04.149. Golchin Hallo delivered the best throughput (10 laps) and landed P2 at 1:05.378 (+1.229s). Dr. D. held P3 with 1:06.404 (+2.255s). After that, the field entered the “regression zone” with sizable deltas.

DavidGFar's tweet photo. Proud to present you our new benchmark.

LaptencyBench (VagoKart GP)
LaptencyBench is a 13-minute closed-track evaluation where each “model” is deployed into a human driver and scored on best single-lap time (lower is better).
Secondary signals: laps completed (throughput) and gap to SOTA (delta vs. the fastest lap).

SOTA: Senior GLiNER set the pace with a 1:04.149.
Golchin Hallo delivered the best throughput (10 laps) and landed P2 at 1:05.378 (+1.229s).
Dr. D. held P3 with 1:06.404 (+2.255s). After that, the field entered the “regression zone” with sizable deltas.

120

David Golchinfar

@DavidGFar

6 months ago

SauerkrautLM-ColPali v0.1 — Multilingual Multi-Vector Vision Retrievers for Visual Document Retrieval We’re releasing SauerkrautLM-ColPali, a family of late-interaction, multi-vector vision retrievers for Visual Document Retrieval (VDR) — searching PDFs/scans/screenshots directly in the visual space. What we built: Our codebase is a fork/extension of the ColPali engine and provides implementations + processors for multiple VLM backbones: ColQwen3 (1.7B Turbo / 2B / 4B / 8B), ColLFM2 (~450M), and ColMinistral3. The suite targets compact 128-dim embeddings for efficient indexing at scale, and supports EN/DE/FR/ES/IT/PT. ViDoRe Benchmark (128-dim) highlights: • ColQwen3-8B: v1 = 91.08 (#1), v3 = 58.55 (#1) • ColQwen3-2B: v1 = 90.24 (best 1–3B class) • ColLFM2-450M: v1 = 83.56 (best <1B class) • ColQwen3-1.7B Turbo: v1 = 88.89 despite heavy compression Hard parts (and what we learned): Dataset reality for “complex” VDR (ViDoRe v1/v2/v3) Public VDR datasets are a strong baseline (ColPali training data, VisRAG retrieval, multilingual VDR data). But pushing performance on visually complex, real-world docs required two new in-house multilingual datasets focused on tougher layout/visual grounding. Porting LiquidAI/LFM2-VL-450M into a late-interaction multi-vector retriever: LFM2-VL-450M is lightweight and originally English-centric. Multilingual training was not plug-and-play: naïvely mixing all multilingual data often stalled around loss ≈ 0.69 (≈ ln(2)). In many binary/pairwise contrastive setups, ln(2) indicates a “50/50 collapse” (no separation between positives/negatives). We fixed this with curriculum learning + staged training: trained an English-strong and a multilingual variant, merged them, then blended in a small-weight mMARCO multilingual retrieval specialist (EN/DE/IT/FR/ES) to further boost multilingual retrieval and downstream VDR behavior. “Turbo” pruning without breaking the model (ColQwen3-1.7B-Turbo): Based on Qwen3-VL-2B-Instruct, we removed 6 layers and reduced intermediate size (≈ -23% params). To avoid incoherent behavior, we ran a recovery (“healing”) phase: more mMARCO epochs first, then the 2B training recipe. Result: despite >20% reduction, Turbo is only slightly behind our 2B on ViDoRe v1. Links: GitHub: https://t.co/p6S3peZevm HF collection: https://t.co/v2RD132po2 Demo (with heatmaps): https://t.co/m3G7oADjjG A more detailed technical report will follow shortly. If you work on RAG over visually rich PDFs or multilingual enterprise documents, we’d love to hear what you’re building. Thanks to @liquidai ,@Alibaba_Qwen and @ManuelFaysse , @sibille_hugues and the rest of the ColPali-Team

DavidGFar's tweet photo. SauerkrautLM-ColPali v0.1 — Multilingual Multi-Vector Vision Retrievers for Visual Document Retrieval
We’re releasing SauerkrautLM-ColPali, a family of late-interaction, multi-vector vision retrievers for Visual Document Retrieval (VDR) — searching PDFs/scans/screenshots directly in the visual space.

What we built:
Our codebase is a fork/extension of the ColPali engine and provides implementations + processors for multiple VLM backbones: ColQwen3 (1.7B Turbo / 2B / 4B / 8B), ColLFM2 (~450M), and ColMinistral3. The suite targets compact 128-dim embeddings for efficient indexing at scale, and supports EN/DE/FR/ES/IT/PT.

ViDoRe Benchmark (128-dim) highlights:
• ColQwen3-8B: v1 = 91.08 (#1), v3 = 58.55 (#1)
• ColQwen3-2B: v1 = 90.24 (best 1–3B class)
• ColLFM2-450M: v1 = 83.56 (best <1B class)
• ColQwen3-1.7B Turbo: v1 = 88.89 despite heavy compression

Hard parts (and what we learned):
Dataset reality for “complex” VDR (ViDoRe v1/v2/v3)
Public VDR datasets are a strong baseline (ColPali training data, VisRAG retrieval, multilingual VDR data). But pushing performance on visually complex, real-world docs required two new in-house multilingual datasets focused on tougher layout/visual grounding.

Porting LiquidAI/LFM2-VL-450M into a late-interaction multi-vector retriever:
LFM2-VL-450M is lightweight and originally English-centric. Multilingual training was not plug-and-play: naïvely mixing all multilingual data often stalled around loss ≈ 0.69 (≈ ln(2)). In many binary/pairwise contrastive setups, ln(2) indicates a “50/50 collapse” (no separation between positives/negatives). We fixed this with curriculum learning + staged training: trained an English-strong and a multilingual variant, merged them, then blended in a small-weight mMARCO multilingual retrieval specialist (EN/DE/IT/FR/ES) to further boost multilingual retrieval and downstream VDR behavior.

“Turbo” pruning without breaking the model (ColQwen3-1.7B-Turbo):
Based on Qwen3-VL-2B-Instruct, we removed 6 layers and reduced intermediate size (≈ -23% params). To avoid incoherent behavior, we ran a recovery (“healing”) phase: more mMARCO epochs first, then the 2B training recipe. Result: despite >20% reduction, Turbo is only slightly behind our 2B on ViDoRe v1.

Links:
GitHub: https://t.co/p6S3peZevm

HF collection: https://t.co/v2RD132po2

Demo (with heatmaps): https://t.co/m3G7oADjjG

A more detailed technical report will follow shortly.

If you work on RAG over visually rich PDFs or multilingual enterprise documents, we’d love to hear what you’re building.

Thanks to @liquidai ,@Alibaba_Qwen and @ManuelFaysse , @sibille_hugues and the rest of the ColPali-Team

559

DavidGFar retweeted

Pau Labarta Bajo

@paulabartabajo_

6 months ago

ColLFM2: 450M multimodal embedding model built on top of LFM2-ColBERT-350M Enjoy ↓ https://t.co/ugYRDjRsZJ

David Golchinfar

@DavidGFar

6 months ago

yes it is actually build on the top of https://t.co/9214dcbTY6 Indeed it would be interesting to use the lfm colbert variant as a base or even merge into the text decoder of the lfm2-vl-450M. We actually did something similar here training the vl model with text retrieval data first and use this text specialist as a "submodel" for merging.

David Golchinfar

@DavidGFar

6 months ago

@SebastianB929 I did not test different speakers. But in general the new version feels more "natural". The old version was a bit to emotional with the default settings imo.

David Golchinfar

@DavidGFar

7 months ago

We live and breathe #OpenSource. And we are committed to a #SovereignEurope 🇪🇺. With our new #SauerkrautLM-GliNER Release. Modelcard on Huggingface: https://t.co/6gViX6dZ48 Test the Model yourself in our Demo Space on Hugging Face 👉https://t.co/ocj7YDaGNr Why this model release matters for Europe’s Sovereinty: 💠 Sovereign enterprise AI depends on high-quality data and efficient AI-supported workflows — otherwise you risk an expensive “garbage in, garbage out” loop. 💠 True data quality goes far beyond vectorization: it requires breaking information down into entities and relations, often structured as graphs. 💠 Standard LLMs aren’t suitable for this — they’re too costly and too generic. 💠 Classic #NER models can extract entities, but only perform well with domain-specific training. 💠 This means every domain needs its own NER model — repetitive, resource-intensive, and hard to scale. 💠 GliNER models solve this by acting as generalist NER systems. 💠 SauerkrautLM-GliNER is our strongest multilingual model yet — outperforming GliNER_multi-v2.1 and GliNER_multi_pii-v1. Trained on five European languages (DE 🇩🇪 , IT 🇮🇹 , EN 🏴󠁧󠁢󠁥󠁮󠁧󠁿 , FR 🇫🇷 , ES 🇪🇸 ), it delivers exceptional multilingual, #crossDomain entity extraction. Big thanks to our colleague Michele Montebovi for leading and delivering the training!