🚀 Introducing SaulLM-141B and SaulLM-54B: The First Open Family of Legal Models.
After #SaulLM-7B the family is growing!
We are proud to unveil the latest innovations from our team: the SaulLM-141B and 54B generative AI models, specifically designed for the legal domain.
Awesome to see them build on the same tricks as BidirLM-Omni, like the decoder-to-encoder adaptation and cross-modality model merging. The synthetic data ablation is also a huge highlight with a +15 gain for code retrieval domains
Gemini Embedding 2 is out, and it's completely omnimodal 😎
Great to see the next chapter of encoders heading toward text, audio, and visual unlocking so many use cases
https://t.co/t1d3HqRcdl
👏 Congratulations to @cohere on Command A+ — a powerful new model optimized for NVIDIA Blackwell and trained using NVIDIA CUDA-X libraries.
Proud to be a part of it!
Learn more ⤵️
Releasing open-source under the Apache 2.0 license. We want to give developers direct access to enterprise-grade agentic capabilities from experimentation to production.
Sovereign AI. For all.
Download Command A+: https://t.co/USXpmpid01
Or learn more: https://t.co/mXb3WLHN85
Cohere launches open weights model Command A+ that achieves 37 on the Artificial Analysis Intelligence Index
The release of Command A+ places @Cohere in line with Claude 4.5 Haiku on the Intelligence Index, and just above NVIDIA Nemotron 3 Super and Gemini 3.1 Flash-Lite.
Key Takeaways:
➤ Command A+ ranks first on AA-Omniscience Non-Hallucination at 86%, ~3 percentage points ahead of the next-best model. Its AA-Omniscience Accuracy is 9%, so the headline AA-Omniscience score lands at -4, demonstrating a similar archetype to Claude 4.5 Haiku, where the model knows its limits
➤ On Cohere’s API, Command A+ (~281 output tokens per second) is faster than several comparable open-weights and small to mid-sized proprietary models (e.g., GPT-5.4 nano, Claude 4.5 Haiku, and Grok 4.3), but still slower than Gemini 3.1 Flash-Lite Preview, which outputs 304 tokens per second
➤ Command A+ trails its peer set on scientific reasoning (HLE ~11%, GPQA Diamond ~76%) and on coding (Terminal-Bench Hard ~25%, SciCode ~38%), consistent with gaps on the hardest science and agentic coding benchmarks
➤ It supports visual reasoning and scores 63% on MMMU-Pro (between Claude 4.5 Haiku at 59% and GPT-5.4 nano (xhigh) at 65%)
Introducing: Cohere Command A+
We’ve created our most powerful LLM yet, optimized it to run on as little hardware as possible, and released it open-source for all.
🚨 Do LLMs need to store everything they read in memory?
To reduce KV cache size and improve decoding speeds, we propose Self-Pruned KV attention, a mechanism where the model learns to decide which KVs to write in the persistent KV cache, discarding all the rest! @AIatMeta🧵
@JinaAI_ has hopped on the omnimodal train🚂
They just dropped a collection of two Omni embedding models (0.9B & 2B). Similar to BidirLM, they seem to rely on the Qwen modality head for the larger one, while sticking with EuroBERT for the nano version 🥰
https://t.co/A8BQma6Zpn
BidirLM-Omni is on MTEB and Sentence-Transformer!
https://t.co/JRqmipX5xl
🥇#1 Open-Source Model on MTEB (#15 overall)
🖼️#1 across all sizes on MIEB (Image)
🎧#1 sub-7B model on MAEB (Audio, #2 overall)
Small size, massive performance, Fully open
Model: https://t.co/AZzOJ6ZhhN
We are currently presenting 'Should We Still Pretrain Encoders with Masked Language Modeling?' Come see us in Hall 3 #1304 @iclr_conf
https://t.co/kaPLch0Qen
BERT-as-a-Judge
A robust alternative to rigid lexical matching for LLM evaluation. Matches the performance of LLM-as-a-Judge at a fraction of the computational cost.
BERT-as-a-Judge
A robust alternative to rigid lexical matching for LLM evaluation. Matches the performance of LLM-as-a-Judge at a fraction of the computational cost.
Encoders are so much better for classification, why not use them for judging?
Awesome study from @N1colAIs - cool to see a 210m BERT model beating much larger Qwen and Gemma models.
Evaluation is underrated. If your eval signal is noisy, you're flying blind. BERT-as-a-Judge gives you a fast, cheap way to improve your signal-to-noise ratio without spinning up a full LLM judge. Exactly the kind of infra work that compounds. @gisship@N1colAIs congrats!
🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️
Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.
🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️
Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.
There's a wave of omni embedding models (gemini, nemotron, bidirlm). Excited to support this trend with our multimodal mteb versions (mieb, maeb) - video coming soon🎥
Omni embeddings are becoming the new standard. Glad to see @N1colAIs@Muennighoff pushing multimodal eval forward with MIEB & MAEB — can't wait for the video!
There's a wave of omni embedding models (gemini, nemotron, bidirlm). Excited to support this trend with our multimodal mteb versions (mieb, maeb) - video coming soon🎥